Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

The Claudini paper (arXiv, March 2026) introduces Autoresearch for automated discovery of LLM adversarial attacks. The five-stage loop includes literature mining, hypothesis generation, experiment implementation, large-scale evaluation, and strategy evolution via genetic algorithms and RL. Attacks surpassing SOTA were found on GPT-4, Claude 3.5, Gemini Pro, and Llama 3 70B, including novel context drift attacks. Opens new directions for automated red team testing in AI safety.

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for

LLMs #

Paper Overview The

Claudini paper published on arXiv in March 2026 introduces a revolutionary AI safety research methodology: using AI systems to automatically discover adversarial attack algorithms against large language models (LLMs). The Autoresearch concept refers to AI research systems autonomously designing experiments, executing tests, analyzing results, and iteratively optimizing attack strategies without continuous human researcher intervention. The research team used this approach to discover multiple adversarial attack algorithms surpassing current state-of-the-art (SOTA) methods. #

Technical Methodology The

Claudini system core is an automated research loop. The system first analyzes existing adversarial attack literature and publicly available methods, constructing an attack strategy knowledge graph. It then leverages LLM code generation capabilities to automatically implement new attack variants, execute tests against target models, and collect success rate data. Based on experimental results, the system automatically adjusts parameters and generates new attack variants in iterative cycles. Specific attack techniques include automated optimization of gradient-based token substitution attacks, automatic combination of multi-step context manipulation strategies, and novel jailbreak methods exploiting model internal representation spaces. Research found that automatically discovered attack combinations are often more efficient and stealthy than manually designed human methods. #

Experimental Results

In testing against GPT-4, Claude 3.5, Gemini Pro, and Llama 3 70B, Claudini discovered attack algorithms surpassing prior SOTA methods across multiple safety evaluation benchmarks. Notably, the system identified a novel class of context drift attacks, gradually shifting model safety boundaries through carefully designed multi-turn conversations that evade existing safety filters. #

Security Implications and Ethics

The research sparked extensive discussion in the AI safety community. Automated vulnerability discovery capabilities hold significant value for red team testing and AI safety evaluation, helping companies identify potential risks before model release. However, if malicious actors leverage the same technology, it could dramatically lower the technical barrier for LLM attacks. The authors took careful responsible disclosure measures, delaying publication of the most destructive attack variant implementation details and privately notifying affected AI companies. This research also advances broader discussion about balancing AI security research openness with attack defense.

Sources

arXiv