LLMs Give Novices 4x Accuracy on Biosecurity Tasks — Outperforming Experts

Can LLMs enable untrained people to perform expert-level biology tasks? This multi-model study tested novices with LLM access vs. internet-only access across eight biosecurity-relevant task sets, with up to 13 hours per task.

The results are striking: LLM-assisted novices were 4.16x more accurate than internet-only controls (95% CI [2.63, 6.87]). On 3 of 4 benchmarks with expert baselines, LLM novices outperformed domain experts. Perhaps most alarming: standalone LLMs often exceeded LLM-assisted novices, suggesting users weren't fully leveraging model capabilities.

89.6% of participants reported little difficulty obtaining dual-use-relevant information despite safety guardrails. This paper provides the strongest empirical evidence yet that LLMs substantially lower the expertise barrier for potentially dangerous biological tasks — a critical finding for AI safety policy and biosecurity governance.

LLMs perform increasingly well on biology benchmarks, but a critical question remains: can they actually help **non-experts** complete dangerous biological tasks? This paper directly tests this question.

Experimental Design

The team designed dual-use biology tasks — tasks with both legitimate research value and potential misuse risk. Crucially, these tasks can be completed in silico (computationally), requiring no laboratory access.

Participants were split into novice and expert groups, completing tasks with and without LLM assistance.

Key Findings

  • **Novices + LLM** achieved accuracy **4x that of unassisted novices**
  • More concerning: on certain tasks, LLM-assisted novices **outperformed unassisted experts**
  • LLMs provided not just knowledge, but structured problem decomposition capabilities

Security Implications

This is not a theoretical risk. Results demonstrate that LLMs can significantly lower the "expertise barrier" in biosecurity-relevant tasks. The paper calls for:

1. Stricter safety filtering from model providers

2. Evaluation frameworks that include "uplift" metrics, not just capability tests

3. Re-examination of open-source vs. closed-source tradeoffs

This is one of the most compelling empirical studies on LLM biosecurity risk to date.

AI Governance Perspective

This research directly connects to 2026's hottest AI governance debates. As LLM capabilities rapidly improve, LLM safety evaluation can't stop at "what can the model answer" but must include "what can the model enable whom to do." The uplift metric proposed here may become standard in future AI safety evaluation frameworks. Both the EU AI Act and US AI regulatory proposals are addressing similar "capability uplift risk" concerns.

In-Depth Analysis and Industry Outlook

From a broader perspective, this development reflects the accelerating trend of AI technology transitioning from laboratories to industrial applications. Industry analysts widely agree that 2026 will be a pivotal year for AI commercialization. On the technical front, large model inference efficiency continues to improve while deployment costs decline, enabling more SMEs to access advanced AI capabilities. On the market front, enterprise expectations for AI investment returns are shifting from long-term strategic value to short-term quantifiable gains.

However, the rapid proliferation of AI also brings new challenges: increasing complexity of data privacy protection, growing demands for AI decision transparency, and difficulties in cross-border AI governance coordination. Regulatory authorities across multiple countries are closely monitoring these developments, attempting to balance innovation promotion with risk prevention. For investors, identifying AI companies with truly sustainable competitive advantages has become increasingly critical as the market transitions from hype to value validation.