Researchers introduce One-to-Many Temporal Grounding (OMTG), a new AI task for localizing multiple video segments matching a single text query. They establish the first OMTG benchmark with 56k samples and novel evaluation metrics, achieving 43.65% performance—outperforming advanced models like Gemini 2.5 Pro by 15.85%.
This research addresses a fundamental limitation in current video understanding systems. While existing temporal grounding models excel at matching single text queries to individual video segments, real-world applications frequently require identifying multiple relevant segments within the same video. The paper identifies a critical gap: state-of-the-art multimodal large language models fail dramatically on this task because they lack event cardinality perception—the ability to recognize and locate multiple instances of the same concept.
The work builds on advances in video-language understanding but extends the problem formulation to be more practically relevant. Traditional one-to-one temporal grounding has dominated research, leaving a substantial blind spot in model capabilities. The introduction of Count Accuracy and Effective Temporal F1 metrics provides standardized evaluation methods where previous benchmarks did not exist.
The technical contributions leverage policy optimization with specialized reward functions, particularly a caption reward mechanism that uses chain-of-thought reasoning over dense video captions. This approach explicitly guides models toward both preciseness (accurate temporal boundaries) and completeness (finding all relevant segments), addressing the core cardinality perception problem.
The 56k-sample dataset represents substantial annotation effort and establishes a foundation for future research. The performance gap between their model and industry leaders like Gemini demonstrates that OMTG represents a genuine algorithmic challenge beyond simple scaling. This work matters for applications in video search, content analysis, surveillance systems, and video editing tools where identifying multiple occurrences of events or objects is essential. The standardized benchmark will likely drive rapid progress in the community.
- →One-to-Many Temporal Grounding requires AI models to locate multiple disjoint video segments matching a single text query, a task where current state-of-the-art models perform near-zero.
- →The first OMTG benchmark includes 56k curated samples and introduces Count Accuracy and Effective Temporal F1 metrics specifically designed for multi-segment localization evaluation.
- →The proposed model achieves 43.65% EtF1, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61% respectively, establishing new state-of-the-art performance.
- →Novel caption reward functions using chain-of-thought reasoning over dense video captions guide optimization toward both temporal precision and segment completeness.
- →Event cardinality perception—the ability to recognize and locate multiple instances of concepts—emerges as a critical capability gap in current video-language models.