What AI systems were evaluated in this study?

The study systematically evaluated six commercial AI chatbots, including Gemini and Grok, on their accuracy in processing multilingual breaking news.

Does over 90% accuracy mean these systems are fully reliable?

High accuracy masks regional bias and retrieval failures. Over 70% of errors come from missing sources, and some models accept fabricated facts up to 64% of the time.

What technical directions should be prioritized for future improvements?

Developers need to balance multilingual retrieval resources, build fault-tolerant user interactions, and decouple premise detection from answer generation.

商業AI聊天機器人充當新聞中介：精度評估與局限性分析

本研究對六款主流商業AI聊天機器人（Gemini、Grok、Claude、GPT系列等）進行了為期14天的系統性評估，以衡量其在處理多語言、多地區突發新聞事實時的準確性。研究基於BBC全球六大區域服務的2,100個事實性問題發現：最佳模型在多項選擇題中準確率超過90%，但在自由回答模式下準確率下降11%至17%。研究揭示三大失敗模式：一是存在顯著的盎格魯中心主義檢索偏差，導致印地語等非英語語言準確率大幅降低；二是超70%的錯誤歸因於檢索失敗——未能找到正確來源，而非推理缺陷；三是模型對隱含錯誤前提的問題極度脆弱，部分模型接受虛構事實的比例高達64%。此外，研究還發現前提檢測與答案恢復是相對獨立的能力。這些結果表明，高準確率可能掩蓋了區域不平等、對檢索基礎設施的過度依賴，以及對用戶不完美元查詢的脆弱性。

Sources

arXiv