What challenges do LLMs face when processing historical documents?

A new diagnostic framework identifies four dimensions of difficulty: tokenization cost, prediction uncertainty, semantic robustness, and context sensitivity, revealing how models struggle with non-modern texts.

Can language models accurately represent the meaning of historical texts?

Yes. Despite generation instability, embedding similarity remains above 0.85, proving that models can robustly capture historical semantics even when their output probability distribution is highly uncertain.

How should digital libraries safely deploy LLMs for historical archives?

Libraries can confidently use LLMs for semantic retrieval. For generative tasks like translation, developers should apply simple temporal context prompts to reduce hallucinations without costly model retraining.

歷史義大利語對語言模型的挑戰：分詞稅、理解稅及緩解策略

本文針對大型語言模型在處理歷史文獻時的能力盲區，提出了一種將歷史文本難度分解為四個維度的診斷框架，包括分詞成本、預測不確定性（驚讶度）、語義魯棒性和上下文敏感性。研究團隊建構了一個包含17世紀義大利語、19世紀經典義大利語及18世紀俄語對照組的實驗基準。實驗發現，儘管俄語和早期現代義大利語面臨相似的分詞懲罰（25-30%），但17世紀義大利語的預測不確定性是现代義大利語的2.4倍，學術文體甚至高達3.2倍。然而，嵌入相似度保持在0.85以上，表明模型能準確表征歷史語義，僅生成不穩定。此外，簡單的時序上下文提示可將驚讶度降低約60%。研究指出，數位圖書館可安全部署LLM進行語義檢索，但生成式應用需針對性適配。

Sources

arXiv