What is the Adversarial Pragmatics evaluation framework?

Adversarial Pragmatics is an AI safety evaluation framework that analyzes ambiguous behaviors in natural language. It replaces traditional pass/fail labels with a linguistically controlled classification system comprising 18 seed benchmarks for granular model behavior analysis.

Why does this framework matter?

It precisely distinguishes root causes of model failures—capability gaps, policy ambiguity, instruction conflicts, or evaluator bias. This provides a more scientific and transparent approach to AI safety research, enabling better comparability across studies and teams.

What are the practical applications and future directions?

The framework can validate LLM judges, optimize gold-set construction, and serve as a deep detection tool for prompt injection attacks. Long-term, it will support building more robust, interpretable AI systems and advance the rigor of safety research.

對抗語用學：基於指令衝突與隱含命令的AI安全評估基準

本文提出「對抗語用學」評估框架，旨在解決現有大模型安全評估中因自然語言模糊性導致的誤判問題。傳統基準常將複雜行為壓縮為簡單的通過/失敗標籤，掩蓋了能力限制、策略模糊、指令衝突等根本原因。研究構建了一套語言學控制的分類體系，包含18項種子基準及54行本地種子試點數據，並設計了專家評估協議以區分任務成功、策略合規、安全風險及拒絕結果。通過引入評估者置信度、診斷模糊度和分類漂移等指標，該框架不僅提升了評估透明度，還為驗證安全評估流程、LLM裁判、提示注入測試及文檔構建提供了實用工具，顯著增強了AI安全研究的嚴謹性。

Sources

arXiv