It is the first verifiable benchmark for small-molecule preclinical pharmacology, testing AI agents' ability to draw conclusions from real experimental data rather than memorized literature.

What key problem does it reveal?

Even the best-performing model achieved only a 59.3% pass rate, showing that AI currently lacks the reliability to independently handle preclinical pharmacology decisions for industrial use.

What are the future development directions?

As the first slice of the broader TherapeuticsBench project, additional benchmarks covering other drug discovery stages and therapeutic modalities are expected in the near future.

TxBench-PP: Assessing Real-World Decision-Making of AI Agents in Small-Molecule Preclinical Pharmacology

This paper introduces TxBench-PP, a verifiable benchmark designed for small-molecule preclinical pharmacology to evaluate the decision-making reliability of AI agents in real-world drug discovery scenarios. Unlike traditional tests that rely on literature memorization, this benchmark requires agents to recover accurate conclusions from genuine experimental data. The study tested 16 configurations comprising 11 models across 100 evaluation tasks covering five dimensions, including mechanisms of action and pharmacokinetics, generating a total of 4,800 trajectories. Results show that no system can reliably execute preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, succeeded in only 59.3% of endpoint attempts, indicating that current AI still faces significant limitations in processing complex, unstructured real-world experimental data, falling short of industrially reliable application.

Background and Context

The integration of artificial intelligence into drug discovery has long promised to compress the cycle of interpretation and decision-making, thereby accelerating the path from molecular identification to clinical candidate. However, the transition from theoretical potential to practical deployment in pharmaceutical workflows requires a rigorous, trustworthy evaluation of agent performance in real-world scenarios. Historically, benchmarks in this domain have disproportionately focused on an agent's ability to memorize and retrieve known literature facts, testing knowledge recall rather than scientific reasoning. This approach fails to capture the complexity of actual drug discovery, where data is often noisy, unstructured, and derived from novel experiments rather than curated textbooks. To address this critical gap, the research team introduced TxBench-PP (TherapeuticsBench Preclinical Pharmacology), the first verifiable benchmark specifically designed for small-molecule preclinical pharmacology. As the initial slice of the broader TherapeuticsBench project, it represents a paradigm shift from "knowledge retrieval" to "scientific reasoning," providing a new methodological foundation for assessing the reliability of automated decision-making in critical stages of drug development.

TxBench-PP is engineered to simulate the authentic workflows of pharmaceutical research with high fidelity. The benchmark comprises 100 evaluation tasks, indexed by procedural stage, experimental type, and task structure. These tasks span five critical dimensions: mechanism of action (MoA) inference, pharmacodynamics (PD) inference, compound-target binding, causal target validation, and developability and safety assessments. Unlike traditional tests that present simplified questions, agents in TxBench-PP receive snapshots of real workflows. They are placed in a coding environment where they must independently inspect and analyze various data files. This design forces the agent to demonstrate the ability to process unstructured data, identify key information amidst noise, and perform logical deductions. The final outputs are structured answers scored by deterministic rules, ensuring that the evaluation results are objective, reproducible, and reflective of ecological validity in industrial settings.

Deep Analysis

The experimental setup for TxBench-PP involved a comprehensive evaluation of 16 model-harness configurations, drawn from 11 distinct foundation models. This large-scale testing generated a total of 4,800 reasoning trajectories, ensuring statistical significance and representativeness in the results. The findings reveal a stark reality: no system tested was able to reliably execute preclinical pharmacology decisions. This universal bottleneck indicates that current state-of-the-art AI models still struggle significantly with the complexities of scientific reasoning in this domain. The results challenge the assumption that scaling model parameters alone translates to reliable scientific agency, highlighting instead the need for architectures and training data that better support complex, multi-step logical inference in noisy environments.

The performance metrics provide a clear quantification of these limitations. The strongest configuration, Claude Opus 4.8 paired with the Pi harness, achieved a pass rate of only 59.3% in endpoint attempts, succeeding in 178 out of 300 trials (95% confidence interval: 51.1%-67.6%). This figure is particularly telling, as it falls well below the threshold required for industrial reliability, where near-perfect accuracy is often necessary to avoid costly errors in drug development. The second-best configuration, GPT-5.5 / Pi, performed slightly lower, with a pass rate of 55.3% (166/300, confidence interval 47.0%-63.6%). These numbers underscore that even the most advanced commercial models are not yet capable of autonomous, reliable decision-making in this specific scientific context. The significant variance among different configurations also suggests that factors such as model architecture, the quality of training data, and prompt engineering strategies play a crucial role in performance, indicating that optimization is possible but currently insufficient for full automation.

Industry Impact

The release of TxBench-PP has profound implications for both the open-source research community and the pharmaceutical industry. For researchers, it provides a standardized, realistic benchmark that helps accurately measure model progress. By moving away from simplified datasets that may create an illusion of "false prosperity," TxBench-PP forces the community to confront the actual capabilities of AI agents. This shift is essential for directing future research efforts toward solving genuine scientific problems rather than optimizing for benchmark scores on trivial tasks. It establishes a new baseline for what constitutes a "successful" agent in drug discovery, one that must demonstrate robust reasoning over unstructured data rather than mere fact retrieval.

For pharmaceutical companies, the results serve as a critical risk warning. The finding that no system can reliably perform preclinical pharmacology decisions suggests that AI agents are not yet ready to independently drive this stage of drug discovery. This insight advises companies to adopt a cautious approach, investing in hybrid intelligence workflows that combine AI efficiency with human expert oversight. The high error rates observed, even in the best-performing models, highlight the necessity of rigorous manual verification before any AI-generated decision is acted upon. Furthermore, as the first slice of the TherapeuticsBench project, TxBench-PP signals the beginning of a more granular approach to AI evaluation in drug discovery. Future benchmarks will likely cover other stages of the drug discovery pipeline and different therapeutic modalities, fostering a more refined and practical evaluation ecosystem that aligns closely with industry needs.

Outlook

Looking ahead, the primary challenge for the field is to enhance the reasoning capabilities and decision-making reliability of AI agents when faced with complex, unstructured real-world data. TxBench-PP provides a clear metric and direction for this improvement, emphasizing the need for models that can handle noise and ambiguity inherent in experimental data. Future research will likely focus on developing specialized architectures and training methodologies that better support multi-step scientific inference.

The success of configurations like Claude Opus 4.8 / Pi offers a benchmark for what is currently possible, but the gap to industrial reliability remains significant. Bridging this gap will require not only advancements in large language models but also improvements in how agents interact with experimental data and laboratory workflows. As the TherapeuticsBench project expands, it will provide a comprehensive framework for tracking progress across the entire drug discovery lifecycle, ultimately guiding the development of AI systems that can truly augment human scientists in the quest for new therapies. The journey from promising prototype to reliable industrial tool is ongoing, and TxBench-PP marks a crucial step in defining the path forward.

Sources

arXiv