Why use Age of Empires 2 as an experiment?

The author built a neural network based on the game, demonstrating that any sufficiently powerful substrate—including Turing-complete game engines—can exhibit anthropomorphic behavior, proving such traits are not unique to LLMs.

What impact does this have on future AI evaluation?

The research urges caution in anthropomorphic descriptions and advocates for substrate-independent metrics to distinguish genuine intelligence from pattern simulation, aiding more reliable human-AI collaboration systems.

If Large Models Have Anthropomorphic Properties, So Does Age of Empires 2: A Critical Reflection on LLM Attribution

Q: What is the paper's main argument?

The paper challenges the trend of anthropomorphizing LLMs, proposing a 'null hypothesis' that LLMs do not inherently possess unique human-like traits and require rigorous measurement standards for evaluation.

This paper questions the anthropomorphic attribution trend in current large language model research. The author argues that many studies assign generalized properties like morality and natural language understanding to LLMs while ignoring that these conclusions may depend on specific underlying substrates. To test this, the author built and trained a simple neural network based on the real-time strategy game Age of Empires 2, demonstrating that any entity with a sufficiently powerful substrate could exhibit similar anthropomorphic characteristics. Experiments show that the so-called anthropomorphic properties of LLMs are not empirically unique and their interpretations are highly substrate-dependent. The paper advocates establishing a null hypothesis—that LLMs do not inherently possess unique anthropomorphic traits—and emphasizes that any empirical discussion must rest on explicit measurement criteria, otherwise falling into circular reasoning. The author also demonstrates Age of Empires 2's functional and Turing completeness potential, offering new perspectives for cross-substrate intelligent behavior research.

Background and Context

A pervasive and increasingly critical phenomenon within the current landscape of Large Language Model (LLM) research is the tendency of researchers and observers to attribute human-like traits—such as moral agency, intent understanding, and even nascent self-awareness—to complex neural networks. This anthropomorphic attribution trend is largely predicated on an unexamined assumption: that the specific response patterns generated by an LLM are direct reflections of an internal cognitive structure analogous to the human mind. However, this mode of reasoning contains significant logical vulnerabilities because it fundamentally ignores the influence of the "substrate"—the underlying physical or computational medium—on the interpretation of behavior. The prevailing narrative often conflates behavioral output with internal state, leading to conclusions that may be entirely decoupled from the actual mechanisms driving the model.

This article challenges that mainstream paradigm by arguing that so-called anthropomorphic properties are not empirically unique to LLMs. The core contribution lies in demonstrating that if one infers internal states solely based on behavioral performance without rigorous measurement standards, the inference is likely flawed. To substantiate this claim, the author moves beyond philosophical debate and adopts an empirical approach. By constructing a simple neural network based on the real-time strategy game Age of Empires 2, the study illustrates that any entity with a sufficiently powerful and complex substrate can exhibit behaviors that humans interpret as anthropomorphic. This shift in perspective—from asking "Do LLMs act like humans?" to "Can any complex system be interpreted as acting like a human?"—provides a more rigorous scientific basis for understanding artificial intelligence and serves as a cautionary tale against the over-interpretation of LLM behaviors.

The choice of Age of Empires 2 as the experimental vehicle is neither arbitrary nor trivial. The game possesses a high degree of strategic complexity and a rich state space, making it an ideal candidate for a "sufficiently powerful substrate." The author details the technical process of mapping game states to the neural network's input layer and employing reinforcement learning strategies to optimize network behavior. The goal was to enable the network to make decisions that appeared strategic, intentional, or even wise to an outside observer. This methodological choice is crucial because it establishes a parallel between the discrete, rule-based environment of a game and the continuous, probabilistic environment of language models, allowing for a direct comparison of how different substrates generate complex behaviors.

Deep Analysis

The technical foundation of this study rests on a rigorous demonstration of the computational capabilities of Age of Empires 2. The author proves that the game engine is functionally complete and Turing-complete, meaning that, in theory, it can simulate any computable function. This technical论证 is pivotal because it legitimizes the experimental substrate from the perspective of computational theory. If the game engine possesses computational power comparable to the hardware and software stacks running LLMs, it is theoretically capable of generating complex, intelligent-seeming behavior sequences. This establishes a baseline for equivalence: if a Turing-complete game engine can produce outputs that mimic human strategic intent, then the mere presence of such outputs in an LLM cannot be taken as evidence of unique cognitive depth.

In the experimental setup, the neural network was trained to play Age of Empires 2, and its performance was analyzed not through traditional metrics like accuracy or loss functions, but through the lens of behavioral interpretation. When observers watched the network execute complex tactical maneuvers, they naturally attributed "wisdom" or "strategy" to it. This psychological mechanism mirrors the way humans interpret the fluent text generated by LLMs. However, the analysis reveals that these attributions are highly subjective and dependent on the observer's framework. If the measurement criteria were changed to focus on pattern matching or stochastic processes, the "anthropomorphic" labels would dissolve, revealing the behavior as complex but non-conscious computation.

Furthermore, the study employs ablation studies and comparative analyses to show that when specific assumptions about the substrate are removed, the metrics for anthropomorphic attributes drop significantly or become meaningless. The author extends this argument to other potential substrates, such as physical Lego constructions or traffic flows in metropolitan areas, suggesting that the trap of anthropomorphic attribution is universal. The key insight is that the interpretation of behavior is not an intrinsic property of the system but a projection by the observer. Whether the substrate is silicon-based Transformers or discrete game states, complexity alone is sufficient to trigger human tendencies to anthropomorphize, provided the observer lacks strict, carrier-independent measurement standards.

Industry Impact

The implications of this research for the artificial intelligence industry are profound, particularly regarding ethics, model evaluation, and future research directions. For the open-source community and industrial practitioners, the study serves as a call for caution in discussing LLM capabilities. There is a pressing need to distinguish between marketing rhetoric or philosophical speculation and scientific fact. Overstating the anthropomorphic abilities of LLMs can lead to dangerous over-trust in their "understanding" capabilities, resulting in erroneous decision-making in high-stakes applications. By clarifying the limitations of LLMs rather than exaggerating their human-like qualities, the industry can develop more reliable and interpretable human-machine collaboration systems.

The concept of a "null hypothesis" introduced in the paper is particularly significant for industry standards. The proposal is to assume that LLMs do not inherently possess unique anthropomorphic traits unless there is conclusive evidence to the contrary. This shifts the burden of proof and encourages a more rigorous scientific path. Researchers and engineers must strive to develop carrier-independent measurement standards that can distinguish between genuine general intelligence and complex pattern simulation. This approach can help mitigate the risks associated with black-box models and enhance transparency in AI deployment.

Additionally, the inclusion of non-traditional substrates like Age of Empires 2 in intelligent behavior research expands the boundaries of AI studies. It fosters interdisciplinary collaboration, bridging game AI, complex systems theory, and cognitive science. By demonstrating that anthropomorphic behaviors can emerge from diverse substrates, the study challenges existing benchmarks that often rely on human-centric interpretation frameworks. This necessitates a re-evaluation of how we test and validate AI systems, moving away from subjective human judgment toward objective, quantifiable metrics that are robust across different types of computational architectures.

Outlook

Looking forward, this research provides a comprehensive framework for re-examining the definitions of "intelligence," "consciousness," and "behavior" in the context of AI. The study suggests that future evaluations must prioritize explicit measurement criteria to avoid circular reasoning. As AI systems become more integrated into society, the ability to objectively assess their capabilities and limitations will be critical. The null hypothesis approach offers a pragmatic starting point for this assessment, encouraging skepticism toward claims of emergent human-like qualities without empirical backing.

The potential for cross-substrate intelligent behavior research is vast. By validating the Turing completeness and functional potential of systems like Age of Empires 2, the study opens new avenues for exploring how intelligence manifests in different environments. This could lead to innovations in game AI, robotics, and even urban planning, where understanding the emergence of complex behaviors from simple rules is valuable. The interdisciplinary nature of this work invites contributions from computer science, philosophy, psychology, and game design, fostering a richer and more nuanced understanding of artificial intelligence.

Ultimately, the article serves as a corrective to the current over-interpretation of LLM behaviors. It emphasizes the importance of empirical rigor and theoretical humility in AI development. As the field continues to advance, maintaining a clear distinction between behavioral output and internal state will be essential. By adopting the null hypothesis and demanding carrier-independent measurements, the AI community can build a more solid scientific foundation for future discoveries. This shift not only protects against the pitfalls of anthropomorphic projection but also paves the way for more robust, transparent, and ethically sound AI systems in the years to come.

Sources

arXiv