What is LoMo (Local Modality Replacement)?

A lightweight data curation method that dynamically selects text spans, renders them as images, and interleaves them to teach cross-modal invariance.

Why do vision-language models degrade during modality substitution?

Asymmetric text-image roles in training data create carrier sensitivity. Reasoning fails when input carriers shift because models lack robust alignment.

What impact does LoMo have on AI development?

Boosts multimodal reasoning significantly on major models. Its architecture-agnostic design enables low-cost integration, accelerating real-world AI deployment.

LoMo：透過局部模態替換實現更深層的視覺語言融合

本文針對視覺語言模型在模態替換時性能顯著下降的問題，提出了一種名為局部模態替換（LoMo）的輕量級數據策範式。研究指出，現有訓練數據中文本與圖像的不對稱角色導致模型存在載體敏感性偏差，無法對齊語義等價內容的跨模態表示。LoMo將單模態提示重構為無縫交錯的模態序列，動態選擇目標文本片段並將其渲染為圖像，從而在「文本-視覺-文本」結構中提供跨模態表示不變性的監督信號。在13個多模態基準上的廣泛實驗表明，LoMo顯著提升了整體多模態推理能力，在LLaVA-OneVision-1.5-8B和Qwen3.5-9B上分別相比標準監督微調提升了2.67和2.82個百分點。

Sources

arXiv