TxBench-PP is the first verifiable benchmark for preclinical pharmacology. It tests if AI can derive scientific conclusions from real experimental data, not memorized literature.

What key limitations does it reveal about current AI?

Claude Opus 4.8 passed only 59.3% of attempts. This reveals a massive gap in scientific reasoning, proving current models lack reliability for independent drug decisions.

What should the industry focus on next?

The industry must build robust evaluation frameworks, improve AI noise tolerance and causal reasoning on real data, and develop specialized models for pharmaceutical decision-making.

TxBench-PP: Assessing Real Reasoning in AI Agents for Small-Molecule Preclinical Pharmacology

This paper introduces TxBench-PP, the first verifiable benchmark focused on small-molecule preclinical pharmacology, designed to evaluate AI agents' ability to handle real experimental data in the early stages of drug discovery. Unlike traditional tests that rely on memorized literature knowledge, this benchmark requires agents to recover accurate conclusions from real assay data. The study covers five major task categories, including mechanisms of action, pharmacokinetics, and compound-target binding, and was evaluated across 16 model configurations with 4,800 reasoning trajectories. Results show that no existing system can reliably make preclinical pharmacology decisions. The best configuration, Claude Opus 4.8, passed only 59.3% of endpoint attempts, revealing a significant gap in AI's capacity for complex scientific reasoning and underscoring the urgent need for more reliable evaluation frameworks to advance AI adoption in the pharmaceutical industry.

Background and Context

The pharmaceutical industry stands at a critical juncture where the integration of artificial intelligence into drug discovery pipelines promises to compress the traditional timelines for new molecular entity development. However, the transition from theoretical potential to practical deployment is hindered by a significant lack of rigorous, verifiable evaluation frameworks. Current benchmarking methodologies predominantly assess large language models on their ability to memorize and retrieve existing literature, a task that bears little resemblance to the daily realities of preclinical pharmacology. In real-world scenarios, scientists must navigate noisy, unstructured, and heterogeneous experimental data to derive actionable conclusions. To address this gap, researchers have introduced TxBench-PP (TherapeuticsBench Preclinical Pharmacology), the first benchmark specifically designed to evaluate AI agents' capacity for handling real experimental data in the early stages of small-molecule drug discovery. Unlike previous tests that reward rote memorization, TxBench-PP requires agents to perform genuine scientific reasoning by extracting accurate insights from raw assay data, thereby simulating the complex decision-making processes inherent in pharmaceutical research.

The design of TxBench-PP represents a paradigm shift in how AI capabilities in life sciences are measured. The benchmark focuses on five core task categories essential to preclinical pharmacology: mechanisms of action, pharmacokinetics, compound-target binding, causal target validation, and developability and safety. By constructing a testing environment that mirrors industrial workflows, the study aims to expose the true limitations of current AI systems. The benchmark comprises 100 independent evaluation cases, each meticulously indexed by project phase, assay type, and task structure. This granularity allows for a nuanced assessment of where models succeed or fail, moving beyond aggregate accuracy scores to identify specific cognitive bottlenecks in scientific reasoning. The ultimate goal is to provide a clear roadmap for model optimization, ensuring that future iterations of AI agents are equipped to handle the intricacies of drug discovery rather than merely regurgitating known facts.

Deep Analysis

The technical architecture of TxBench-PP is engineered to enforce a high-fidelity simulation of a scientist's workflow. Agents are presented with a programming-like interface where they receive real-world workflow snapshots and must independently locate and inspect relevant files and data sets. This setup demands more than natural language proficiency; it requires the ability to process structured data, write or interpret code to extract information, and synthesize findings into structured outputs. These outputs are then scored using deterministic algorithms, ensuring that the evaluation is objective, reproducible, and free from the subjectivity often associated with human grading. This methodological rigor is crucial for establishing trust in AI-driven decision-making, as it eliminates ambiguity in performance measurement and provides a stable baseline for comparing different model configurations.

The experimental evaluation involved a comprehensive testing of 16 model-tool configurations across 11 different base models, generating a total of 4,800 reasoning trajectories. The results revealed a stark reality: no existing system could reliably perform preclinical pharmacology decisions at a level suitable for industrial application. The highest-performing configuration, Claude Opus 4.8 paired with the Pi tool, achieved an endpoint pass rate of only 59.3% (178 out of 300 attempts, with a 95% confidence interval of 51.1-67.6). The second-best configuration, GPT-5.5 with Pi, followed closely with a pass rate of 55.3% (166 out of 300, 95% confidence interval of 47.0-63.6). These figures are alarming, as they indicate that even the most advanced commercially available models struggle to maintain reliability when faced with the complexity of real experimental data. The performance gap suggests that current architectures are not yet robust enough to support autonomous decision-making in critical scientific domains.

Further analysis through ablation studies highlighted significant variations in model performance across different task types. Tasks such as causal target validation and translational efficacy assessment proved to be particularly challenging, resulting in substantially higher error rates. These tasks require deep logical inference and the ability to connect disparate pieces of evidence, exposing the limitations of models that rely heavily on pattern matching rather than causal reasoning. The data clearly delineates the current performance boundaries of AI agents in scientific reasoning, demonstrating that simply increasing model parameters or refining prompt engineering strategies is insufficient to overcome these fundamental deficits. The findings underscore the need for architectural innovations that enhance an agent's ability to tolerate noise in experimental data and integrate multimodal information effectively.

Industry Impact

The introduction of TxBench-PP has profound implications for both the open-source research community and the pharmaceutical industry at large. For the open-source community, the benchmark provides a standardized, reproducible platform that shifts the focus from superficial accuracy metrics to the robustness of models in complex, long-chain reasoning tasks. This shift encourages researchers to develop more sophisticated evaluation metrics and to prioritize the reliability of AI agents in scientific contexts. By establishing a common ground for comparison, TxBench-PP facilitates more meaningful collaboration and accelerates the development of next-generation models that are better suited for real-world applications. It serves as a catalyst for innovation, pushing the boundaries of what is currently possible in AI-driven drug discovery.

For the pharmaceutical industry, the results of TxBench-PP serve as a critical reality check. The benchmark reveals the significant limitations of current AI technologies in assisting with drug discovery, particularly in making high-stakes decisions. This insight urges companies to exercise caution when relying on AI for critical phases of the drug development pipeline. Instead of treating AI as a replacement for human expertise, the industry must view it as a tool that requires extensive validation and oversight. The benchmark also highlights the urgent need for investment in specialized models that are optimized for scientific reasoning. Pharmaceutical companies may need to allocate more resources to developing proprietary AI systems that can handle the specific nuances of their data, rather than relying solely on general-purpose large language models.

Moreover, TxBench-PP marks the beginning of the TherapeuticsBench project, laying the groundwork for future expansions into other therapeutic modalities and stages of drug discovery. This expansion will further solidify the importance of establishing credible, verifiable evaluation frameworks in the AI drug discovery sector. The benchmark emphasizes that building trust in AI systems is as important as developing the models themselves. As the industry moves forward, the ability to validate AI decisions against real experimental data will become a key differentiator for companies seeking to leverage AI for competitive advantage. The benchmark thus acts as a benchmark for trust, guiding the industry toward more responsible and effective integration of AI technologies.

Outlook

Looking ahead, the development of AI agents capable of reliable preclinical pharmacology decisions will require a multifaceted approach that addresses the current limitations identified by TxBench-PP. Future research must focus on enhancing the noise tolerance of models when processing real experimental data, which is often messy and incomplete. Improving the ability of agents to integrate multimodal information, such as combining textual data with chemical structures and assay results, will be essential for achieving a holistic understanding of biological systems. Additionally, advancing causal reasoning capabilities will be critical for tasks that require inferring cause-and-effect relationships from observational data, a common scenario in pharmacology.

The trajectory of AI in drug discovery will likely see a shift towards more specialized, domain-specific models that are fine-tuned on high-quality, curated datasets. These models will need to be embedded within robust validation frameworks that continuously test their performance against real-world benchmarks like TxBench-PP. Collaboration between AI researchers, pharmacologists, and data scientists will be vital to ensure that these models are not only technically sophisticated but also scientifically valid. The industry must also prioritize the development of tools that allow for greater transparency and interpretability, enabling scientists to understand and trust the reasoning processes of AI agents.

Ultimately, the goal is to realize the revolutionary potential of AI in accelerating the discovery and development of new medicines. However, this vision can only be achieved if the industry commits to rigorous evaluation and continuous improvement of AI systems. TxBench-PP provides a crucial starting point for this journey, highlighting the gaps that need to be bridged and the standards that must be met. As the technology evolves, the focus must remain on building AI agents that are not just intelligent, but also reliable, robust, and capable of contributing meaningfully to the advancement of human health. The path forward requires patience, investment, and a steadfast commitment to scientific integrity.

Sources

arXiv