Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
AuthorsBo Feng°, Zhengfeng Lai*°, Shiyu Li, Zizhen Wang°, Simon Wang, Ping Huang, Meng Cao
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
AuthorsBo Feng°, Zhengfeng Lai*°, Shiyu Li, Zizhen Wang°, Simon Wang, Ping Huang, Meng Cao
This paper was accepted at the Evaluating the Evolving LLM Lifecycle Workshop at NeurIPS 2025.
Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model’s temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
September 29, 2025research area Computer Vision, research area Methods and Algorithmsconference NeurIPS
We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy,…
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
June 30, 2025research area Computer Vision, research area Methods and Algorithmsconference ICCV
Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and…