A training-free memory framework for long-form video generation that leverages Large Language Models to extract entity attributes and assign Global IDs for consistent tracking.

Why does IAMFlow matter?

It overcomes the limitations of coarse-grained attention by explicitly tracking persistent entities, effectively solving identity drift and attribute loss in complex narratives.

What are the next steps or applications?

Requiring no additional training, it integrates directly into existing models while introducing NarraStream-Bench to standardize the evaluation of narrative consistency in AI video.

IAMFlow: A Training-Free Identity-Aware Memory Framework for Narrative Long Video Generation

To address long-term consistency and memory degradation in autoregressive video generation, we propose IAMFlow, a training-free entity identity-aware memory framework. Conventional methods rely on preset strategies to compress historical frames or coarse-grained attention to retrieve keyframes, struggling with identity drift and attribute loss caused by shifting entity references in prompts. IAMFlow leverages an LLM to extract visual attributes of entities and assign global IDs, combined with asynchronous visual verification via a VLM to validate rendered frame attributes, enabling explicit entity tracking. To maintain computational efficiency, the framework introduces accelerated strategies including asynchronous visual validation, adaptive prompt conversion, and model quantization. Furthermore, we present NarraStream-Bench, a new benchmark comprising 324 multi-prompt scripts and a three-dimensional evaluation protocol. Experiments demonstrate that IAMFlow surpasses the strongest baseline by 2.56 points on NarraStream-Bench and achieves 1.39× speedup under 60-second multi-prompt settings, significantly enhancing narrative coherence and generation efficiency in long-form video synthesis.

Background and Context

Autoregressive video generation has achieved remarkable strides in visual fidelity and interactive capabilities, yet it continues to grapple with the formidable challenge of maintaining long-term consistency and memory integrity when generating extended narrative sequences. As prompts evolve over time and entity references shift within the narrative structure, existing solutions frequently fail to preserve character identity, leading to issues such as identity drift, character duplication, and attribute loss. Conventional approaches typically rely on pre-defined strategies to compress historical frames or utilize coarse-grained implicit attention signals to retrieve keyframes. These methods are inherently limited in their ability to handle the dynamic changes in entity references that are characteristic of complex storytelling, often resulting in degraded generation quality due to inaccurate implicit matching.

To address these critical limitations, researchers have introduced IAMFlow, a training-free, identity-aware memory framework designed to explicitly model and track the identities of persistent entities throughout the video generation process. Unlike previous methods that struggle with the nuances of shifting narrative contexts, IAMFlow ensures consistency during prompt transitions by implementing a robust identity management mechanism. This approach allows the system to effectively navigate the complexities of dynamic narrative scenarios, offering a novel technical pathway for long-form video synthesis. By resolving the issues of memory degradation and identity inconsistency that have plagued earlier models, IAMFlow provides a significant reference point for future research in the domain of generative video.

Deep Analysis

The technical architecture of IAMFlow employs a synergistic multimodal system to construct its identity-aware memory. The process begins with a Large Language Model (LLM) performing a deep parsing of the prompt for each frame to extract entities along with their specific visual attributes. The system then assigns a unique Global ID to each entity, a mechanism that enables precise differentiation between distinct characters and objects. This explicit ID assignment prevents the confusion often seen in traditional methods where similar features lead to misidentification. By moving away from implicit similarity matching, IAMFlow establishes a clear, trackable lineage for each entity within the generated video.

Complementing the LLM-based extraction, the framework integrates a Vision-Language Model (VLM) as an asynchronous verification module. This VLM validates the attributes of rendered video frames against the entity descriptions in the prompts, correcting any deviations in real-time. This asynchronous visual verification allows the video rendering and attribute validation processes to occur in parallel, significantly enhancing computational efficiency. Furthermore, the framework incorporates adaptive prompt conversion strategies and model quantization techniques to optimize the computational load and reduce memory overhead. These acceleration strategies ensure that the high precision of the identity tracking does not come at the cost of prohibitive latency or resource consumption.

Industry Impact

To rigorously evaluate the performance of IAMFlow, the research team constructed NarraStream-Bench, a new benchmark specifically tailored for narrative streaming video generation tasks. This benchmark comprises 324 multi-prompt scripts covering six distinct narrative dimensions and utilizes a three-dimensional evaluation protocol. This protocol integrates traditional video generation metrics with multimodal large language model-based assessments, providing a comprehensive measure of both narrative coherence and visual quality. The establishment of NarraStream-Bench offers the academic community a standardized platform for evaluating progress in long-form video generation, fostering more consistent and comparable research outcomes.

Experimental results demonstrate that IAMFlow achieves state-of-the-art performance on NarraStream-Bench, surpassing the strongest baseline by 2.56 points. Notably, in 60-second multi-prompt generation settings, IAMFlow achieves a 1.39x speedup compared to the most efficient baseline methods. Ablation studies further highlight the critical role of asynchronous verification and explicit ID tracking in enhancing identity consistency, confirming the effectiveness of the proposed methods in mitigating memory degradation. The training-free nature of IAMFlow allows researchers to directly apply it to existing video generation models, lowering technical barriers and computational costs, thereby accelerating the iteration of related technologies.

Outlook

The introduction of IAMFlow holds profound implications for both the open-source community and industrial applications. Its ability to provide a highly interpretable and stable solution for long video generation positions it as a valuable tool for industries requiring high narrative coherence, such as film production and game development. By enabling the creation of consistent, long-form narratives, IAMFlow facilitates the practical application of AI video generation in professional workflows. The framework's modular design and efficiency optimizations suggest that it can serve as a foundational infrastructure for future advancements in the field.

Looking ahead, as multimodal models continue to evolve and computational resources become more optimized, IAMFlow is poised to become a standard component in the toolkit of content creators. The explicit entity tracking and memory management capabilities it introduces pave the way for more complex and natural narrative forms in AI-generated content. By addressing the fundamental challenges of long-term consistency, IAMFlow not only enhances the current state of video generation but also sets a new benchmark for future innovations, driving the industry toward more sophisticated and reliable storytelling capabilities.

Sources

arXiv