Scale Can't Overcome Pragmatics: Why VLMs Fail at Spatial Reasoning Despite Web-Scale Data

VLMs consistently underperform on spatial reasoning, counting, temporal relations, and negation. This multimodal AI research identifies a more fundamental cause: reporting bias. Humans naturally omit obvious information when describing visual content, leaving AI training data severely lacking in these four categories.

Analyzing training corpora from OpenCLIP, LLaVA-1.5, and Molmo, these reasoning signals remain scarce even at billions of samples. Experiments confirm: scaling model size, data size, and multilingual training all fail to produce emergence. The only effective approach is intentionally collecting annotations with tacit information. Conclusion: deliberate data curation matters more than blindly pursuing scale.

Why do Vision-Language Models (VLMs) perform so poorly at spatial reasoning, counting, temporal relations, and negation? The common explanation is "not enough scale." This paper proposes a more fundamental cause: **reporting bias**.

What is Reporting Bias

When describing visual content, humans automatically omit "obvious" information. We caption a stadium photo "at the game today!" not "37 people standing behind a green field in bleachers." This omission is a fundamental feature of language — but fatal for VLM training, as models can never learn what's never annotated.

Data Analysis

The team analyzed training data from OpenCLIP, LLaVA-1.5, and Molmo:

  • **Spatial relations** (above/below/left/right): extremely rare in captions
  • **Counting information**: precise numbers almost never appear
  • **Temporal relations** (before/after): severely underrepresented
  • **Negation**: virtually absent

Even at billions of samples with synthetic data generation, these four categories remain scarce.

Key Findings

1. VLMs consistently underperform on these four reasoning types

2. **Scaling model size doesn't help** — capabilities don't "emerge" at scale

3. **Scaling data size doesn't help** — web-scale data inherently lacks this information

4. **Multilingual doesn't help** — reporting bias is universal across languages

5. **Intentional annotation works** — targeted collection of spatial/counting labels significantly improves performance

Implications

Don't count on scale to solve everything. The next VLM breakthrough may come from smarter data curation, not bigger models or more data.

Warning for Multimodal AI

This paper carries implications for the entire multimodal AI field. Current AI training data for multimodal models primarily comes from internet image-text pairs, and reporting bias is an inherent feature of internet content. Simply scaling web crawling won't solve this—dedicated data collection strategies are needed. For teams developing vision-language models, this is an essential AI training data quality consideration.

In-Depth Analysis and Industry Outlook

From a broader perspective, this development reflects the accelerating trend of AI technology transitioning from laboratories to industrial applications. Industry analysts widely agree that 2026 will be a pivotal year for AI commercialization. On the technical front, large model inference efficiency continues to improve while deployment costs decline, enabling more SMEs to access advanced AI capabilities. On the market front, enterprise expectations for AI investment returns are shifting from long-term strategic value to short-term quantifiable gains.