AIBullisharXiv – CS AI · 10h ago7/10
🧠
P-JEPA: Procedural Video Representation Learning via Joint Embedding Predictive Architecture
Researchers propose P-JEPA, a new video representation learning architecture that processes procedural videos over 30 minutes long by reducing complexity through dense action prediction. The method achieves state-of-the-art results on multiple benchmarks while using significantly fewer parameters than LLM-based approaches and enabling real-time inference.