Agent navigation is statistically random despite matching human accuracy.

Reward-hallucination link?

Higher rewards paradoxically increase hallucination.

Document Agents' Reasoning Is Overestimated — MADQA Shows Navigation Is Mostly Luck

Document Agents Navigate by Luck: MADQA's Uncomfortable Truth MADQA, built with Classical Test Theory, reveals that top multimodal agents match human accuracy but their navigation is statistically indistinguishable from random search. They find answers by searching enough locations, not by understanding document structure. Implications: (1) Efficiency — agents consume far more tokens than necessary; (2) Reliability — random search fails on complex documents; (3) Evaluation — accuracy metrics mask reasoning quality deficits. Also: IndexCache achieves 1.82x prefill speedup.

Document Agents

Navigate by Luck: MADQA's Uncomfortable Truth MADQA, built with Classical Test Theory, reveals that top multimodal agents match human accuracy but their navigation is statistically indistinguishable from random search. They find answers by searching enough locations, not by understanding document structure. Implications: (1) Efficiency — agents consume far more tokens than necessary; (2) Reliability — random search fails on complex documents; (3) Evaluation — accuracy metrics mask reasoning quality deficits. Also: IndexCache achieves 1.82x prefill speedup. And higher reward scores paradoxically increase hallucination — important for RLHF training strategies. #

In-Depth Analysis and Industry Outlook From

a broader perspective, this development reflects the accelerating trend of AI technology transitioning from laboratories to industrial applications. Industry analysts widely agree that 2026 will be a pivotal year for AI commercialization. On the technical front, large model inference efficiency continues to improve while deployment costs decline, enabling more SMEs to access advanced AI capabilities. On the market front, enterprise expectations for AI investment returns are shifting from long-term strategic value to short-term quantifiable gains. However, the rapid proliferation of AI also brings new challenges: increasing complexity of data privacy protection, growing demands for AI decision transparency, and difficulties in cross-border AI governance coordination. Regulatory authorities across multiple countries are closely monitoring these developments, attempting to balance innovation promotion with risk prevention. For investors, identifying AI companies with truly sustainable competitive advantages has become increasingly critical as the market transitions from hype to value validation. From a supply chain perspective, the upstream infrastructure layer is experiencing consolidation and restructuring, with leading companies expanding competitive barriers through vertical integration. The midstream platform layer sees a flourishing open-source ecosystem that lowers barriers to AI application development. The downstream application layer shows accelerating AI penetration across traditional industries including finance, healthcare, education, and manufacturing. Additionally, talent competition has become a critical bottleneck for AI industry development. The global war for top AI researchers is intensifying, with governments worldwide introducing policies to attract AI talent. Industry-academia collaborative innovation models are being promoted globally, with the potential to accelerate the industrialization of AI technology.