LoMo: Achieving Deeper Vision-Language Fusion via Local Modality Replacement
This paper addresses the significant performance degradation in vision-language models during modality substitution by proposing Local Modality Replacement (LoMo), a lightweight data curation paradigm. The study identifies that the asymmetric roles of text and images in existing training data lead to representational bias toward specific carriers, preventing the model from aligning cross-modal representations of semantically equivalent content. LoMo achieves this by reconstructing unimodal prompts into seamlessly interleaved modality sequences—dynamically selecting target text spans and converting them into rendered images—thereby providing supervisory signals for cross-modal representation invariance within a text-visual-text structure. Extensive experiments across 13 multimodal benchmarks demonstrate that LoMo substantially enhances overall multimodal reasoning capabilities, yielding improvements of 2.67 and 2.82 percentage points over standard supervised fine-tuning on LLaVA-OneVision-1.5-8B and Qwen3.5-9B respectively.
Background and Context
Vision-language models (VLMs) have achieved remarkable progress in multimodal understanding and reasoning tasks, yet a critical vulnerability remains largely overlooked: carrier sensitivity. Ideally, replacing a textual query with a semantically equivalent rendered image should not degrade model performance. However, empirical evidence demonstrates that such modality substitution leads to significant performance drops. This study attributes the issue to inherent biases within existing training corpora. In mainstream datasets—including image captioning, visual question answering, optical character recognition, and web-interleaved data—text typically serves as the primary language query, while images function merely as visual references. This asymmetric role allocation creates a preference disparity in how models acquire information across modalities.
The consequence of this data bias is an inability to align cross-modal representations of semantically equivalent content. When the input carrier shifts from text to image, the model’s reasoning process becomes fragile, indicating a lack of robust cross-modal alignment. To address this, researchers have introduced Local Modality Replacement (LoMo), a lightweight, architecture-agnostic data curation paradigm. LoMo is designed to provide supervisory signals for cross-modal representation invariance between semantically equivalent text and image carriers. By reconstructing unimodal prompts into seamlessly interleaved modality sequences, LoMo forces the model to learn more robust alignment mechanisms without altering the underlying neural architecture.
Deep Analysis
The core technical innovation of LoMo lies in its data generation strategy rather than complex network structural adjustments. The method begins by extracting key text spans from existing unimodal prompts. These selected text segments are then dynamically converted into rendered images using rendering technology. These rendered images are inserted into the original sequence, creating a "original text-rendered image-following text" interleaved structure. This design preserves the original semantic content while introducing the visual modality as an intermediate bridge. Consequently, the model is compelled to utilize visual cues simultaneously when processing textual information, fostering a deeper understanding of semantic content.
This approach effectively mitigates representation misalignment caused by data bias. By exposing the model to diverse modality combinations during training, LoMo encourages the learning of more generalized cross-modal representations. The model reduces its dependency on specific modal carriers, thereby enhancing its generalization capabilities in complex multimodal scenarios. The "text-visual-text" structure provides rich supervisory signals for cross-modal representation invariance. This mechanism ensures that the model does not merely memorize text-image pairs but learns to recognize semantic equivalence regardless of the input format. The dynamic selection of target text spans allows for flexible and context-aware data augmentation, making the training process more efficient and effective.
Industry Impact
Extensive experiments conducted across 13 diverse multimodal benchmarks validate the effectiveness of LoMo. The results consistently demonstrate substantial improvements in overall multimodal reasoning performance. Specifically, on the LLaVA-OneVision-1.5-8B model, LoMo achieved a performance gain of 2.67 percentage points over standard supervised fine-tuning. Similarly, on the Qwen3.5-9B model, the improvement reached 2.82 percentage points. These gains remained consistent across different model scales, underscoring the method's universality. Ablation studies further revealed the critical role of dynamic text span selection and image rendering strategies in driving these performance enhancements.
From an industry perspective, LoMo offers a low-cost, high-efficiency optimization path for the development of large multimodal models. Its architecture-agnostic nature allows for easy integration into existing training workflows without requiring additional computational resources or complex engineering implementations. This is particularly valuable for the open-source community and industrial practitioners, enabling them to enhance model performance at a lower cost. Furthermore, LoMo highlights the importance of training data quality and diversity. It suggests that future research should focus more on data curation strategies to fully unlock the potential of multimodal models, rather than solely increasing data scale.
Outlook
The implications of LoMo extend beyond immediate performance metrics. It provides a new perspective on solving multimodal alignment problems through data curation rather than architectural modification. This shift in focus is crucial for advancing the field, as it addresses the root cause of carrier sensitivity rather than treating symptoms. The method's success in complex reasoning and fine-grained understanding tasks suggests that richer supervisory signals can significantly boost model robustness. As multimodal systems become increasingly integrated into critical applications such as autonomous driving, medical diagnosis, and intelligent assistants, the need for robust cross-modal alignment becomes paramount.
LoMo serves as a foundational step toward building more resilient and intelligent multimodal systems. By challenging the status quo of data bias and carrier dependency, it encourages a reevaluation of how multimodal learning is approached. Future work may explore extending LoMo to other modalities or integrating it with other advanced training techniques. The emphasis on data quality and diversity aligns with broader industry trends toward more efficient and sustainable AI development. Ultimately, LoMo not only represents a technical innovation but also a profound reflection on the nature of multimodal learning, paving the way for more capable and reliable AI systems in the near future.