StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
AuthorsHaibo Wang‡‡, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge†, Afshin Dehghan, Meng Cao, Ping Huang
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
AuthorsHaibo Wang‡‡, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge†, Afshin Dehghan, Meng Cao, Ping Huang
We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.
† Fudan University
‡‡ Work done during Apple internship
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
August 22, 2025research area Computer Vision, research area Methods and Algorithmsconference COLM
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B),…
MM-Ego: Towards Building Egocentric Multimodal LLMs
April 11, 2025research area Computer Vision, research area Speech and Natural Language Processingconference ICLR
This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D based on human-annotated data. This is one of the largest egocentric QA datasets…