Historical Italian vs. LLMs: Tokenization Tax, Comprehension Tax, and Mitigation Strategies
This paper addresses a critical blind spot in how large language models process historical texts and introduces an innovative diagnostic framework that decomposes processing difficulty into four independent dimensions: tokenization cost, prediction uncertainty (surprisal), semantic robustness, and contextual sensitivity. The research team built an evaluation dataset spanning three centuries, featuring newly annotated 17th-century Italian manuscripts, 19th-century English literary classics as a high-exposure control, and 18th-century Russian books for orthogonal stress testing. A key finding reveals a significant decoupling between encoding cost and comprehension difficulty: while both Russian and early modern Italian incur a 25–30% tokenization penalty, the 17th-century Italian texts exhibit a prediction surprisal 2.4 times higher than their modern counterparts (3.2 times for academic prose), far exceeding Russian. Yet embedding similarity remains consistently above 0.85, demonstrating that models maintain stable historical semantic representations. Simple temporal context prompting can reduce surprisal by approximately 60%. These results suggest that digital libraries can safely deploy LLMs for semantic search, while generative applications require targeted adaptation.
Background and Context
As large language models (LLMs) increasingly permeate digital library workflows and cultural heritage archives, a critical blind spot has emerged regarding their capacity to process historical texts. Traditional perspectives often treat the difficulty of historical language as a monolithic barrier, conflating orthographic variations, linguistic distance, and pre-training exposure into a single metric of complexity. This study addresses that ambiguity by introducing an innovative diagnostic framework that decomposes processing difficulty into four distinct, independent dimensions: tokenization cost, prediction uncertainty (surprisal), semantic robustness, and contextual sensitivity. This granular approach moves beyond generic performance scores to answer a fundamental question: when models encounter texts from centuries past, do they fail at the encoding stage due to vocabulary shifts, or do they suffer a collapse in deep semantic understanding? Clarifying this distinction is vital for assessing the generalization capabilities of LLMs in low-resource or long-tail language distributions, providing a theoretical foundation for the intelligent transformation of digital humanities.
The technical methodology employed in this research eschews single-benchmark evaluations in favor of a multidimensional assessment protocol. To quantify tokenization cost, the study calculates the ratio of token count to character count, measuring the encoding efficiency loss caused by orthographic variation. Prediction uncertainty is assessed via surprisal, derived from the model’s internal probability distribution, reflecting its cognitive uncertainty regarding historical vocabulary and syntactic structures. Semantic robustness is evaluated by computing the cosine similarity between historical texts and their modern standard counterparts in the embedding space, determining whether the model maintains accurate semantic representation despite generative instability. Finally, the study tests contextual sensitivity by introducing various temporal context prompting strategies. By controlling variables such as comparing 17th-century Italian with 18th-century Russian, the research isolates the effects of linguistic distance from orthographic differences, allowing for precise identification of specific bottlenecks in historical text processing.
Deep Analysis
The experimental dataset spans three centuries, featuring newly annotated 17th-century Italian manuscripts (1610–1689) digitized from original page images, 19th-century Italian literary classics like *I Promessi Sposi* as a high-exposure control, and 18th-century Russian civil printing books for orthogonal stress testing. A pivotal finding is the significant decoupling between encoding cost and comprehension difficulty. Both Russian and early modern Italian incur a 25–30% tokenization penalty, indicating substantial inefficiencies in how modern tokenizers handle historical orthography. However, the impact on prediction uncertainty varies drastically. The 17th-century Italian texts exhibit a prediction surprisal 2.4 times higher than their modern counterparts, rising to 3.2 times for academic prose. This surge far exceeds the mild increase observed in the Russian dataset, highlighting that Italian historical texts pose a unique challenge in terms of lexical and syntactic predictability for current models.
Despite these high generative costs, the study reveals a counterintuitive stability in semantic representation. Embedding similarity remains consistently above 0.85 across all datasets, demonstrating that LLMs maintain robust historical semantic representations even when their generative outputs are unstable. This suggests that the difficulty in processing historical texts stems primarily from a shift in lexical distribution rather than a loss of semantic understanding. The model knows what the text means, even if it struggles to predict the next token accurately. Furthermore, the introduction of simple temporal context prompts was found to reduce surprisal by approximately 60%. This significant reduction proves that external prompt engineering can effectively mitigate cognitive biases in LLMs, aligning their internal representations more closely with the historical context of the input data.
Industry Impact
These findings have profound implications for the deployment of LLMs in digital libraries and cultural heritage digitization projects. The evidence that semantic robustness remains high despite high tokenization costs and prediction uncertainty suggests that digital libraries can safely deploy LLMs for semantic search, classification, and summarization tasks involving historical archives. The risk of semantic misinterpretation is low, meaning that automated indexing and retrieval systems can leverage these models to enhance access to historical documents without introducing significant errors in meaning. This validates the use of LLMs as powerful tools for unlocking the content of digitized manuscripts, allowing researchers to query vast archives using natural language queries that transcend the limitations of traditional keyword matching.
However, the study also highlights critical limitations for generative applications that rely on precise text production. For tasks such as automatic proofreading of historical texts, modern language translation, or creative rewriting, the high surprisal and tokenization penalties pose substantial challenges. The model’s struggle to predict historical vocabulary accurately can lead to hallucinations or stylistically inconsistent outputs. Therefore, industries relying on generative capabilities must adopt targeted adaptation strategies. This includes implementing temporal context prompting to ground the model in the correct era, or investing in fine-tuning on specific historical corpora to reduce the encoding and prediction overhead. The results provide a pragmatic guide for industry stakeholders, indicating that while LLMs are ready for analytical and retrieval roles in digital humanities, generative roles require careful engineering to overcome the inherent biases of modern training data.
Outlook
The decoupling of encoding cost and semantic understanding reveals a nuanced landscape for the future of historical language processing. As the demand for digital access to global cultural heritage grows, the ability to efficiently process long-tail and historical languages becomes a competitive differentiator for AI providers. The current reliance on modern tokenizers creates a persistent tax on historical texts, inflating computational costs and reducing throughput. Future optimization efforts must focus on developing specialized tokenizers or adaptive encoding mechanisms that can handle orthographic variations more efficiently without sacrificing semantic fidelity. This could involve training models on mixed temporal corpora or implementing dynamic tokenization strategies that adjust based on the detected era of the input text.
Moreover, the effectiveness of simple temporal context prompting suggests that lightweight, cost-efficient interventions can yield significant performance gains. This points toward a future where prompt engineering becomes a standard component of historical NLP pipelines, rather than an ad-hoc solution. Researchers and practitioners should explore more sophisticated contextual cues, such as explicit era markers, author biographies, or contemporary event references, to further stabilize model predictions. Ultimately, the goal is to create systems that can seamlessly bridge the gap between historical and modern language, preserving the semantic richness of the past while leveraging the analytical power of modern AI. By addressing the specific challenges of tokenization and surprisal, the field can move closer to a truly inclusive digital humanities infrastructure that serves all eras of human history with equal precision and depth.