OmniAgent: A Native Proactively Perceiving and Reasoning Universal Multimodal Understanding Agent

To address the limitations of passive models whose computational cost scales linearly with video length, and the reliance of existing interactive frameworks on global pre-scanning, this paper presents OmniAgent — the first native universal multimodal agent based on partially observable Markov decision processes (POMDP). OmniAgent reformulates video understanding as an iterative "observe-think-act" loop, selectively extracting audio-visual cues via on-demand actions and storing them in persistent text memory, thereby decoupling reasoning complexity from raw video duration. For training, we introduce agentic supervised fine-tuning (Agentic SFT) and agentic reinforcement learning with a TAURA mechanism that leverages turn-level entropy for credit assignment. Experiments show OmniAgent achieves state-of-the-art results across ten benchmarks, outperforming Qwen2.5-VL-72B (10× larger) on LVBench with only 7B parameters, demonstrating strong positive test-time scaling.

Background and Context

The landscape of long-form video understanding has historically been constrained by the computational inefficiencies inherent in passive multimodal architectures. Traditional models operate under a "receive-all" paradigm, where the system uniformly processes every frame of a video regardless of the specific query's complexity or relevance. This approach results in a linear scaling of computational costs relative to video duration, creating a significant bottleneck for deploying high-fidelity analysis in real-world scenarios where videos can span hours. While recent advancements in interactive frameworks have attempted to mitigate this by introducing user or model-driven interaction, these solutions often rely on global pre-scanning of the entire video content. Consequently, the context window requirements and associated processing costs remain tightly coupled to the raw length of the media, failing to resolve the fundamental tension between analytical precision and operational efficiency.

To address these structural limitations, researchers have introduced OmniAgent, a novel framework that redefines video comprehension through the lens of a native universal multimodal agent. Unlike its predecessors, OmniAgent is the first system to formalize video understanding as a Partially Observable Markov Decision Process (POMDP). This theoretical shift moves the model away from passive data consumption toward an active, cognitive simulation. By adopting an iterative "observe-think-act" loop, OmniAgent mimics human perceptual strategies, allowing it to proactively explore video content on demand. This mechanism enables the selective extraction of critical audio-visual cues, which are then distilled and stored in a persistent text memory. This architectural innovation effectively decouples the complexity of reasoning from the raw duration of the video, enabling efficient deep understanding even within constrained computational environments.

Deep Analysis

The technical efficacy of OmniAgent is underpinned by a sophisticated training regimen designed to instill active perception capabilities from the ground up. A cornerstone of this methodology is Agentic Supervised Fine-Tuning (Agentic SFT), which utilizes best-of-N trajectory synthesis combined with a rigorous two-stage quality control process. This approach provides the model with high-fidelity learning signals, enabling it to acquire the nuanced skills required for proactive exploration without relying on pre-existing global context. By training on optimized trajectories rather than raw, uncurated video streams, the model learns to prioritize information density over temporal completeness, fundamentally altering how it processes visual and auditory inputs.

Further enhancing the agent's decision-making capabilities is the introduction of Agentic Reinforcement Learning integrated with the TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage) mechanism. TAURA represents a significant advancement in credit assignment within long-horizon tasks. By leveraging turn-level entropy to quantify the model's uncertainty at each step of the interaction, TAURA precisely directs rewards toward "pivotal discovery turns"—moments where the agent successfully identifies and extracts key information. This fine-grained reward structure ensures that the model reinforces actions that genuinely contribute to understanding the video's narrative or technical details, rather than merely increasing the number of inference steps. This mechanism allows OmniAgent to dynamically adjust its attention focus, selectively distilling high-information-density textual representations while ignoring redundant or low-value data segments.

Industry Impact

The implications of OmniAgent extend beyond mere performance metrics, offering a new paradigm for resource-constrained multimodal applications. By demonstrating that active perception can decouple reasoning complexity from video length, the framework provides a viable pathway for deploying high-performance video analysis on edge devices or in environments with limited bandwidth and storage. This efficiency gain is particularly critical for industries such as surveillance, archival retrieval, and real-time broadcast monitoring, where processing hours of footage in near real-time is essential. The shift from passive processing to active exploration suggests that future multimodal systems need not scale linearly with data volume, potentially reducing the carbon footprint and hardware costs associated with large-scale video analytics.

Furthermore, OmniAgent's success challenges the prevailing industry dogma that larger parameter counts are synonymous with superior understanding. The model's ability to outperform significantly larger architectures highlights the importance of algorithmic efficiency and training methodology over raw scale. This finding is likely to stimulate increased research interest in agentic frameworks and memory-augmented architectures across the broader AI community. It encourages developers to focus on how models interact with data dynamically, rather than how much data they can ingest statically. The persistent text memory mechanism also opens new avenues for building efficient, searchable multimodal knowledge bases, where long videos can be compressed into concise, semantically rich summaries without losing critical factual details.

Outlook

Empirical evaluations of OmniAgent confirm its status as a state-of-the-art solution for open-source multimodal understanding. Tested across ten distinct benchmarks, including VideoMME and the challenging LVBench, OmniAgent consistently delivered top-tier performance. Most notably, on LVBench, the 7-billion-parameter OmniAgent achieved a score of 50.5%, significantly surpassing the 47.3% score of Qwen2.5-VL-72B, a model with ten times the parameter count. This result not only validates the effectiveness of the POMDP-based active perception framework but also demonstrates a strong positive test-time scaling effect. As the number of inference rounds increases, OmniAgent's performance continues to improve, indicating that the agent can leverage additional exploration steps to uncover deeper semantic layers within the video content.

Looking forward, the integration of TAURA and Agentic SFT sets a new standard for training autonomous agents in complex, dynamic environments. The ability to adaptively manage uncertainty and credit assignment will likely influence the development of agents in other domains requiring sequential decision-making, such as robotic manipulation and autonomous driving. As the community continues to refine these mechanisms, we can expect to see a proliferation of smaller, more efficient models that achieve human-level or superhuman performance through active reasoning rather than brute-force computation. OmniAgent stands as a pivotal step toward this future, proving that intelligent, selective attention is more valuable than comprehensive, passive data ingestion.

Sources