A training-free memory framework for long-form video generation that leverages Large Language Models to extract entity attributes and assign Global IDs for consistent tracking.

Why does IAMFlow matter?

It overcomes the limitations of coarse-grained attention by explicitly tracking persistent entities, effectively solving identity drift and attribute loss in complex narratives.

What are the next steps or applications?

Requiring no additional training, it integrates directly into existing models while introducing NarraStream-Bench to standardize the evaluation of narrative consistency in AI video.

IAMFlow：無需訓練的敘事長影片生成身份感知記憶框架

針對自回歸影片生成中長期一致性及記憶退化問題，本文提出IAMFlow，一種無需訓練的實體身份感知記憶框架。傳統方法依賴預設策略壓縮歷史幀或基於粗粒度注意力檢索關鍵幀，難以應對提示詞中實體指代變化導致的身份漂移與屬性丟失。IAMFlow透過LLM提取實體視覺屬性並分配全域ID，結合VLM異步驗證渲染幀屬性，實現顯式實體追蹤。為保持計算效率，框架引入異步視覺驗證、自適應提示詞轉換及模型量化等加速策略。此外，本文建構NarraStream-Bench基準，包含324個多提示腳本及三維評估協議。實驗表明，IAMFlow在NarraStream-Bench上以2.56分優勢超越最強基線，並在60秒多提示設定下實現1.39倍加速，顯著提升了長影片生成的敘事連貫性與生成效率。

Sources

arXiv