ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
AuthorsKaisi Guan†**, Zhengfeng Lai, Yuchong Sun†, Peng Zhang, Wei Liu, Kieran Liu, Meng Cao, Ruihua Song†
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
AuthorsKaisi Guan†**, Zhengfeng Lai, Yuchong Sun†, Peng Zhang, Wei Liu, Kieran Liu, Meng Cao, Ruihua Song†
Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answer the generated questions through a multi-stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman’s correlation coefficient of 58.47, showing much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation. All codes and datasets will be publicly available soon.
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
October 27, 2025research area Computer Vision, research area Methods and AlgorithmsWorkshop at NeurIPS
This paper was accepted at the Evaluating the Evolving LLM Lifecycle Workshop at NeurIPS 2025.
Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model’s temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger…
STIV: Scalable Text and Image Conditioned Video Generation
August 1, 2025research area Computer Vision, research area Methods and Algorithms
The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV…