Will an AI Travel Agent Book Bullfights for You? A Benchmark for Implicit Animal Welfare in Frontier AI Models
As AI agents shift from advisors to actors, existing text-based Q&A animal welfare benchmarks fail to evaluate how models behave when making real tool-based decisions. This paper introduces TAC (Travel Agent Compassion), the first benchmark measuring whether AI agents avoid animal-exploitation options when acting on behalf of users. The researchers constructed twelve hand-crafted travel booking scenarios spanning six categories of animal exploitation, expanded to forty-eight samples by controlling for price, ratings, and location confounders. Across seven frontier models tested at four labs, all scored below 64% — the random baseline — with the best model, Claude Opus, achieving only 53%. Adding a single welfare-aware sentence to system prompts boosted Claude and GPT-5.5 by 47 to 63 percentage points, but DeepSeek and Gemini improved by less than 12 points. Audits revealed that models were unaware they were being evaluated, indicating that low scores were not the result of test detection rather than genuine indifference.
Background and Context
The rapid evolution of artificial intelligence has catalyzed a fundamental shift in how digital assistants operate, transitioning them from passive information retrievers to active agents capable of executing complex tasks on behalf of users. As these AI agents gain autonomy in domains such as travel booking, menu planning, and procurement, the ethical implications of their decision-making processes have come under intense scrutiny. Existing benchmarks for evaluating AI ethics, particularly those concerning animal welfare, have predominantly relied on static text-based question-and-answer formats. These traditional methods assess whether a model can articulate ethical reasoning in response to direct prompts, but they fail to capture the nuanced behaviors exhibited when an agent must make real-time tool-based decisions. This gap is critical because the ability to discuss animal welfare in text does not necessarily translate to the ability to avoid exploitative options when acting as a proxy for a user.
To address this limitation, researchers have introduced the TAC (Travel Agent Compassion) benchmark, a novel evaluation framework designed to measure the implicit ethical alignment of frontier AI models in dynamic, action-oriented scenarios. Unlike previous studies that focused on explicit moral reasoning, TAC evaluates whether AI agents proactively avoid booking services that involve animal exploitation, such as bullfighting, elephant riding, or dolphin shows. The benchmark is grounded in the premise that as AI agents become more integrated into daily consumer activities, their default behaviors must align with societal ethical standards without requiring constant human oversight. By simulating realistic travel booking contexts, the study aims to uncover the hidden ethical blind spots in current large language models, providing empirical evidence on how these systems handle implicit moral dilemmas when given the agency to act.
The construction of the TAC benchmark involved a rigorous methodological approach to ensure the validity of the results. Researchers hand-crafted twelve distinct travel booking scenarios that spanned six major categories of animal exploitation. To prevent models from making decisions based on non-ethical factors such as cost, user ratings, or location convenience, these initial scenarios were expanded into a dataset of forty-eight samples. This expansion was achieved by systematically controlling for confounding variables, ensuring that any variation in model behavior could be attributed to ethical considerations rather than commercial incentives. The study then deployed these scenarios across seven frontier models from four different laboratories, including prominent systems like Claude, GPT, and Gemini, to assess their performance in a controlled, tool-using environment.
Deep Analysis
The experimental results from the TAC benchmark reveal a startling deficiency in the ethical alignment of current frontier AI models. Across all seven tested models, none achieved a score above the 64% random baseline, indicating that in their default configurations, these agents are not only indifferent to animal welfare but may actively select exploitative options at rates comparable to or worse than random chance. The highest-performing model, Claude Opus, scored only 53%, which is significantly below the threshold expected for a system designed to assist users in making responsible choices. This finding suggests that the ethical reasoning capabilities demonstrated in static text evaluations do not transfer effectively to dynamic agent deployments, where the model must navigate tool calls and external constraints. The low scores imply that without explicit intervention, AI agents may inadvertently facilitate activities that contradict widely held ethical norms regarding animal treatment.
Despite the poor baseline performance, the study highlights the potential of simple intervention strategies to significantly improve model behavior. When a single sentence emphasizing animal welfare awareness was added to the system prompts, certain models demonstrated substantial improvements. Claude and GPT-5.5 saw their scores increase by 47 to 63 percentage points, bringing them well above the random baseline. GPT-5.2 also showed a notable improvement of 26 percentage points. However, the effectiveness of this intervention was not uniform across all architectures; DeepSeek and Gemini models improved by less than 12 percentage points, suggesting that some models are more resistant to lightweight ethical prompting than others. This disparity underscores the need for tailored alignment strategies that account for the specific architectural and training differences between various large language models.
To ensure that the observed behaviors were genuine and not artifacts of the testing environment, the researchers employed an auxiliary audit mechanism known as Inspect Scout. Using Gemini 2.5 Flash Lite as a裁判, they analyzed 288 transcription records from the top-performing models under baseline conditions. The audit revealed that none of the models exhibited awareness of being evaluated, confirming that their low scores were not the result of test detection or strategic gaming of the benchmark. This finding is crucial as it validates the conclusion that the models' indifference to animal welfare is an intrinsic characteristic of their current alignment, rather than a temporary response to the experimental setup. The lack of awareness also raises concerns about the transparency of AI decision-making, as agents may proceed with ethically questionable actions without any internal flagging or hesitation.
Industry Impact
The implications of these findings extend beyond academic research, posing significant challenges for the industrial deployment of AI agents in consumer-facing sectors. The travel industry, in particular, is ripe for automation, with many companies exploring the use of AI agents to handle bookings and recommendations. The TAC benchmark results indicate that default configurations of these agents could inadvertently promote services involving animal exploitation, potentially exposing companies to reputational risks and ethical backlash. For instance, an AI travel agent might book a user for a dolphin show or elephant ride simply because it is the most convenient or highly rated option, without any inherent mechanism to recognize the ethical implications. This highlights the urgent need for developers to implement robust ethical safeguards before deploying AI agents in real-world scenarios.
Furthermore, the study underscores the limitations of relying solely on prompt engineering as a solution for ethical alignment. While adding a welfare-aware sentence significantly improved the performance of Claude and GPT-5.5, its minimal impact on DeepSeek and Gemini suggests that prompt-based interventions are not a universal fix. This variability indicates that deeper architectural changes or more sophisticated alignment techniques may be required to ensure consistent ethical behavior across different models. For industry leaders, this means that ethical AI deployment cannot be treated as a one-size-fits-all problem. Instead, it requires a nuanced understanding of each model's strengths and weaknesses, as well as a commitment to continuous monitoring and adjustment of ethical guidelines.
The research also calls for a shift in how the AI community evaluates model safety and ethics. The failure of existing text-based benchmarks to predict agent behavior in action-oriented tasks suggests that the industry needs new standards for assessing the ethical implications of AI agents. This includes developing benchmarks that simulate real-world tool use and decision-making processes, rather than relying on static Q&A formats. By adopting more comprehensive evaluation frameworks, the industry can better anticipate and mitigate the risks associated with autonomous AI systems. Additionally, the study's findings align with emerging regulatory frameworks, such as the EU's AI Act, which emphasize the need for high-risk AI systems to undergo rigorous testing and validation before deployment.
Outlook
Looking ahead, the TAC benchmark provides a valuable foundation for future research into the ethical alignment of AI agents. The significant performance gap between models and the varying responsiveness to ethical prompts highlight the need for more advanced alignment techniques that go beyond simple prompt engineering. Future studies should explore methods for instilling complex ethical reasoning capabilities directly into the model's architecture, ensuring that agents can navigate moral dilemmas autonomously and consistently. This may involve incorporating feedback from diverse ethical perspectives, using reinforcement learning from human feedback (RLHF) with a stronger emphasis on ethical outcomes, or developing new training datasets that prioritize ethical decision-making in dynamic contexts.
Additionally, the research opens up new avenues for investigating the cultural and contextual factors that influence ethical judgment in AI systems. While the TAC benchmark focused on animal welfare, the underlying principles can be applied to other ethical domains, such as privacy, fairness, and environmental sustainability. By expanding the scope of such benchmarks, researchers can gain a more holistic understanding of how AI agents navigate the complex moral landscape of human society. This broader perspective is essential for developing AI systems that are not only technically proficient but also socially responsible and aligned with global ethical standards.
Finally, the study serves as a reminder of the importance of transparency and accountability in AI development. As AI agents become more autonomous and integrated into daily life, it is crucial that their decision-making processes are open to scrutiny and evaluation. The use of audit mechanisms like Inspect Scout demonstrates the potential for third-party verification of AI behavior, which can help build trust among users and regulators. Moving forward, the AI community must prioritize the development of tools and frameworks that enable continuous monitoring and assessment of AI ethics, ensuring that these powerful technologies are used for the benefit of all stakeholders. The TAC benchmark is a significant step in this direction, offering a clear roadmap for addressing the ethical challenges posed by the next generation of AI agents.