What does the study's diagnostic framework reveal about LLMs processing historical texts?

The framework decomposes difficulty into four dimensions: tokenization cost, prediction uncertainty (surprisal), semantic robustness, and contextual sensitivity.

Do LLMs truly struggle to understand historical texts, and what does this mean for digital libraries?

Embedding similarity remains above 0.85 despite higher tokenization costs, showing stable semantic representation. Digital libraries can safely deploy LLMs for semantic search.

How can generative applications better handle historical texts?

Simple temporal context prompting reduces surprisal by approximately 60%. Generative apps need targeted adaptation or fine-tuning for historical language domains.

歷史義大利語對大模型的挑戰：分詞稅、理解稅及緩解策略

本文針對大型語言模型處理歷史文本時的能力盲區，提出了一種創新的診斷框架，將處理難度解構為分詞成本、預測不確定性（驚恐度）、語義魯棒性和上下文敏感性四個維度。研究團隊構建了覆蓋三個世紀的評估數據集，包括新整理的17世紀義大利語文本、作為對照的19世紀英文經典作品，以及用於正交應力測試的18世紀俄語書籍。實驗揭示關鍵發現：編碼成本與理解難度顯著解離。儘管俄語和早期現代義大利語均面臨25%至30%的分詞懲罰，但17世紀義大利語的預測驚恐度高達現代版本的2.4倍（學術文體達3.2倍），遠超俄語。然而，嵌入相似度始終維持在0.85以上，證明模型具備穩定的歷史語義表徵能力。此外，簡單的時序上下文提示可使驚恐度降低約60%。這表明數字圖書館可安全部署大型語言模型進行語義檢索，但生成式應用需針對性適配。

Sources

arXiv