How did top AI models perform in the TxBench-PP evaluation?

No current system reliably performed preclinical pharmacology decisions. The best configuration, Claude Opus 4.8 with Pi, achieved only a 59.3% pass rate, while GPT-5.5 with Pi scored 55.3%. These results highlight significant gaps in current AI capabilities for complex scientific reasoning and interpreting real-world experimental data.

What are the implications for the pharmaceutical industry?

The findings suggest AI cannot yet replace human experts in early-stage drug discovery. The industry must shift towards 'verification-based' evaluation. Companies should treat AI as an auxiliary tool, maintaining human oversight and multiple validation mechanisms in critical decision-making processes to mitigate risks associated with AI reasoning errors.

TxBench-PP: Evaluating the Real Capabilities of AI Agents in Preclinical Pharmacology Decisions

Q: What is the primary purpose of the TxBench-PP benchmark?

TxBench-PP is a verifiable benchmark for small-molecule preclinical pharmacology, featuring 100 tasks. It evaluates AI agents' ability to derive conclusions from real experimental data rather than relying on memorization, covering mechanisms of action, pharmacodynamics, and safety to assess decision-making in realistic drug discovery scenarios.

This paper introduces TxBench-PP, a verifiable benchmark for small-molecule preclinical pharmacology designed to evaluate the decision-making capabilities of AI agents in realistic drug discovery scenarios. The benchmark comprises 100 evaluation tasks spanning core areas including mechanisms of action, pharmacodynamics, compound-target binding, and safety, requiring AI systems to derive conclusions from real experimental data rather than relying on memorization. Testing across 11 models with 4,800 reasoning trajectories reveals that no current system can reliably perform preclinical pharmacology decisions. The best configuration, Claude Opus 4.8 paired with Pi, achieved endpoint passage on only 59.3% of tasks, while GPT-5.5 / Pi scored 55.3%. These results demonstrate significant gaps in current AI capabilities for complex scientific reasoning and real-world data interpretation, highlighting the urgent need for more rigorous evaluation frameworks in drug discovery.

Background and Context

The integration of artificial intelligence into drug discovery has long been predicated on the promise of compressing the iterative cycles of hypothesis generation and experimental validation. As AI agents become increasingly embedded in research workflows, the industry has faced a critical bottleneck: the lack of rigorous, verifiable evaluation frameworks that reflect the complexities of real-world laboratory decision-making. To address this gap, researchers have introduced TxBench-PP (TherapeuticsBench Preclinical Pharmacology), the inaugural benchmark within the broader TherapeuticsBench initiative. This benchmark is specifically designed to evaluate the decision-making capabilities of AI agents in the context of small-molecule preclinical pharmacology, a phase where errors in reasoning can lead to costly failures in later development stages. Unlike previous benchmarks that rely on static knowledge retrieval or multiple-choice questions, TxBench-PP demands that models derive conclusions from raw, unstructured experimental data, thereby simulating the actual cognitive load placed on human pharmacologists.

The fundamental challenge posed by TxBench-PP is its rejection of memorization-based performance. In traditional scientific AI evaluations, models often succeed by recalling facts from their pretraining data rather than demonstrating genuine reasoning. TxBench-PP circumvents this by providing AI agents with "work snapshots" of real experimental records, statistical outputs, and graphical data. The agents are required to navigate these data sources using programming or logical reasoning tools to answer specific questions regarding mechanisms of action (MoA), pharmacodynamics (PD), compound-target binding affinity, and safety profiles. This setup forces the AI to engage in active data interpretation, exposing vulnerabilities in scientific reasoning that are typically hidden by models that simply retrieve known answers from their internal knowledge bases. By focusing on verifiable outcomes rather than plausible-sounding text, the benchmark establishes a new standard for assessing the reliability of AI in high-stakes scientific environments.

The scope of TxBench-PP is comprehensive, encompassing 100 distinct evaluation tasks that are meticulously indexed by project stage, experiment type, and structural complexity. These tasks cover the core pillars of preclinical pharmacology, including causal target validation, drug development potential assessment, and translational efficacy analysis. The benchmark’s design ensures that the evaluation is deterministic, with scoring based on strict, objective rules that allow for reproducibility. This shift from a "black box" evaluation, where only the final answer is judged, to a "white box" analysis, where the reasoning trajectory is scrutinized, provides researchers with granular insights into where and how AI models fail. It highlights the necessity for AI systems to not only understand language but also to possess the domain-specific pharmacological knowledge and data processing skills required to navigate complex, multi-variable experimental datasets.

Deep Analysis

The empirical results from testing 11 major large language models across 4,800 reasoning trajectories reveal a stark reality about the current state of AI in scientific reasoning. No single system demonstrated the capability to reliably perform preclinical pharmacology decisions, indicating a significant gap between current AI capabilities and the rigorous demands of drug discovery. The highest-performing configuration, Claude Opus 4.8 paired with the Pi strategy, achieved an endpoint passage rate of only 59.3%. This score was derived from 178 successful outcomes out of 300 attempts, with a 95% confidence interval ranging from 51.1% to 67.6%. While this represents the best performance observed, it falls well below the threshold required for autonomous deployment in critical scientific workflows, where error rates must be minimal to ensure patient safety and research integrity. The second-best configuration, GPT-5.5 combined with the Pi strategy, performed even lower, achieving a passage rate of 55.3% (166 out of 300 attempts, with a 95% confidence interval of 47.0-63.6). These figures underscore that even the most advanced proprietary models struggle with the nuanced interpretation of real-world experimental data. The performance gap between these top-tier models and the rest of the field suggests that while architectural improvements and larger parameter counts offer marginal gains, they are insufficient to overcome the fundamental challenges of scientific reasoning. The data indicates that current models often hallucinate causal relationships or misinterpret statistical significance when faced with novel or complex data structures that are not present in their training corpora. Ablation studies conducted as part of the TxBench-PP evaluation further illuminate the specific limitations of existing AI architectures. The results demonstrate that simply increasing model size or optimizing prompt engineering techniques does not yield significant improvements in performance. Instead, the critical differentiator is the model’s ability to construct accurate reasoning chains and deeply understand the context of experimental data. Many models failed not because they lacked the vocabulary to describe pharmacological concepts, but because they could not logically connect disparate pieces of evidence to form a coherent conclusion. This highlights a persistent weakness in current AI systems: their tendency to prioritize linguistic fluency over logical validity, a trait that is particularly dangerous in scientific applications where precision is paramount.

The analysis also reveals that the Pi strategy, which likely involves specific prompting or inference techniques designed to enhance reasoning, provided a measurable but limited boost in performance. However, even with these enhancements, the models remained prone to errors in causal inference and multi-modal data integration. The failure modes identified in the study suggest that AI agents frequently struggle with tasks requiring the synthesis of information from multiple data types, such as combining graphical data with statistical tables. This limitation points to the need for more sophisticated model architectures that can better handle the heterogeneity of scientific data, moving beyond simple text-based reasoning to a more integrated understanding of experimental evidence.

Industry Impact

The publication of TxBench-PP carries profound implications for both the open-source research community and the pharmaceutical industry at large. For the open-source community, the benchmark provides a standardized, reproducible framework for evaluating AI agents in a specialized scientific domain. This standardization is crucial for fostering transparent and fair competition among researchers, allowing for direct comparison of model performance on identical, challenging tasks. By establishing a common ground for evaluation, TxBench-PP encourages the development of algorithms that prioritize accuracy and reliability over superficial fluency. It also serves as a valuable resource for identifying specific failure modes, guiding future research efforts toward addressing the identified gaps in causal reasoning and data interpretation. For pharmaceutical companies and biotechnology firms, the results of TxBench-PP serve as a critical warning against the premature adoption of AI agents as autonomous decision-makers in drug discovery. The data clearly indicates that current AI systems are not yet capable of reliably performing the complex, high-stakes decisions required in preclinical pharmacology. This finding underscores the necessity for human oversight and multi-layered validation mechanisms in any AI-assisted workflow. Rather than replacing human experts, AI agents should be viewed as supportive tools that can accelerate data processing and hypothesis generation, but whose outputs must be rigorously verified by domain specialists. The benchmark highlights the risks associated with over-reliance on AI, particularly in scenarios where errors can have significant financial and safety consequences.

Furthermore, TxBench-PP influences the strategic direction of AI development in life sciences by shifting the focus from generative capabilities to verifiable reasoning. The industry must move away from evaluating AI based on its ability to generate plausible text and toward assessing its capacity to produce accurate, actionable insights from complex data. This shift requires a rethinking of model training strategies, with a greater emphasis on incorporating real-world experimental data and enforcing strict logical constraints during inference. The benchmark also encourages the development of new evaluation metrics that go beyond simple accuracy scores, incorporating measures of reasoning transparency, error analysis, and robustness across diverse data types. The broader impact of TxBench-PP extends to the regulatory landscape, where the validation of AI-driven drug discovery processes is becoming increasingly important. As regulatory bodies begin to consider AI-generated data for approval decisions, the need for standardized, transparent evaluation frameworks becomes critical. TxBench-PP provides a model for such frameworks, demonstrating how AI performance can be assessed in a way that is both scientifically rigorous and practically relevant. This could facilitate the integration of AI into regulated workflows by providing clear evidence of model capabilities and limitations, thereby building trust among stakeholders and accelerating the responsible adoption of AI technologies in drug development.

Outlook

Looking ahead, the introduction of TxBench-PP marks the beginning of a more rigorous era in AI-driven drug discovery. As the TherapeuticsBench initiative expands, it is expected to release additional benchmarks covering other stages of the drug discovery pipeline, including clinical trials and post-market surveillance. This comprehensive approach will enable the development of a holistic evaluation ecosystem that assesses AI performance across the entire drug development lifecycle. By addressing the specific challenges of each stage, these benchmarks will provide a more nuanced understanding of AI capabilities and limitations, guiding the development of specialized models tailored to distinct scientific tasks. The insights gained from TxBench-PP will likely drive significant advancements in model architecture and training methodologies. Future models will need to incorporate more sophisticated reasoning engines capable of handling multi-modal data and constructing complex causal chains. This may involve the integration of symbolic reasoning with neural networks, allowing models to combine the pattern recognition strengths of deep learning with the logical rigor of symbolic AI. Additionally, the emphasis on verifiable reasoning will encourage the development of self-correction mechanisms and uncertainty quantification tools, enabling AI agents to recognize when they lack sufficient information to make a reliable decision.

The industry will also see a growing emphasis on human-AI collaboration frameworks that leverage the strengths of both parties. AI agents will be designed to assist human experts by handling data-intensive tasks and identifying potential hypotheses, while humans will retain ultimate responsibility for decision-making and validation. This collaborative model will not only improve the reliability of AI-driven discoveries but also enhance the efficiency of the drug development process by reducing the time spent on manual data analysis and hypothesis generation. The success of this approach will depend on the development of intuitive interfaces and workflows that facilitate seamless interaction between human researchers and AI systems. Finally, the establishment of TxBench-PP sets a precedent for the evaluation of AI in other scientific domains, such as materials science, chemistry, and biology. The principles of verifiable reasoning, deterministic scoring, and real-world data integration can be adapted to address the unique challenges of these fields. As AI continues to permeate scientific research, the need for robust, transparent, and scientifically grounded evaluation frameworks will only increase. TxBench-PP provides a blueprint for such frameworks, ensuring that AI technologies are developed and deployed in a manner that is both innovative and responsible, ultimately accelerating the discovery of new therapies and improving human health outcomes.

Sources

arXiv