Breaking the Speech-Agent Evaluation Bottleneck: A Reproducible Benchmark Conversion Framework

Speech-based agents have long lacked reliable benchmarks for evaluating tool-calling capabilities. A new framework converts existing text benchmarks into controlled audio evaluation environments without requiring re-annotation of tool patterns or gold labels. Across the Confetti and When2Call datasets, seven fully multimodal models revealed that performance degradation stems mainly from misinterpreting parameter values embedded in speech. Notably, open-source Qwen3 models with over 8 billion parameters achieved more than 80% consistency with proprietary evaluations, opening the door to privacy-preserving assessment pipelines.

Background and Context

The rapid deployment of speech-based agents in real-world applications has exposed a critical gap in evaluation methodologies. While text-based benchmarks for tool-use capabilities are mature, there is a distinct lack of reliable, standardized benchmarks for assessing how these agents perform when interacting via audio inputs. Existing tool-use benchmarks primarily rely on text data, which fails to capture the complexities of acoustic environments, such as background noise, speaker variability, and prosodic nuances, that characterize real-world voice interactions. This disconnect means that high performance on text-based metrics does not necessarily translate to robust performance in speech-based scenarios, where the model must simultaneously handle speech recognition, semantic understanding, and tool execution.

To address this deficiency, recent research introduces a dataset-agnostic, general-purpose framework designed to convert existing text-based benchmarks into controlled audio evaluation environments. The core innovation of this framework lies in its ability to generate high-quality audio evaluation data without requiring the costly and time-consuming re-annotation of tool schemas or gold labels. By leveraging text-to-speech (TTS) synthesis, speaker variation techniques, and environmental noise generation, the framework creates paired text-audio instances that preserve the original dataset's annotation integrity. This approach allows researchers to evaluate multimodal models on their ability to interpret spoken commands and execute tools, providing a more realistic assessment of their operational readiness.

The technical implementation of this framework involves a meticulous data transformation strategy. Advanced TTS engines are used to convert text instructions into audio inputs, incorporating diverse speaker timbres, speech rates, and background noises to simulate complex acoustic conditions. This process forces models to demonstrate robustness against potential speech recognition errors. Crucially, the framework strictly retains the original tool invocation structures and parameter values, ensuring that the evaluation focuses on the model's comprehension of speech content and its tool execution logic, rather than merely testing transcription accuracy. This method significantly reduces the cost of building audio benchmarks and offers a verifiable pathway for standardized multimodal assessment.

Deep Analysis

The study conducted extensive evaluations using seven prominent full multimodal large language models, including both proprietary and open-source options, across two representative benchmarks: Confetti and When2Call. These benchmarks were selected for their differing levels of task complexity and interaction scenarios. The Confetti benchmark focuses on specific tool-use patterns, while When2Call emphasizes temporal and contextual reasoning in tool invocation. The experimental results revealed that model performance is highly dependent on both the specific architecture and the nature of the task. For instance, Gemini-3.1-Flash-Live achieved the highest score of 70.4 on the Confetti dataset, demonstrating strong capabilities in handling structured tool calls. In contrast, GPT-Realtime-1.5 led the When2Call benchmark with a score of 71.9, indicating superior performance in more complex, context-dependent scenarios.

A key finding of the analysis is the existence of a significant "Text-to-Voice Gap," which measures the performance degradation when transitioning from text to audio inputs. This gap varied considerably across models, ranging from a minimal 1.8-point drop for Qwen3-Omni to a more substantial 4.8-point drop for GPT-Realtime-1.5. This variance highlights that even top-tier models struggle to maintain parity between modalities. Further investigation into failure cases revealed that the primary cause of performance degradation was not speech recognition errors, but rather misunderstandings of parameter values within the speech input. Models frequently confused temporal, spatial, or object attributes when these were conveyed through audio, suggesting that current architectures may not fully integrate prosodic cues with semantic parameter extraction.

To simulate more complex real-world deployment scenarios, the study introduced ambiguity-based reconstruction stress tests and a no-reference evaluation protocol using large language models as judges. These additional tests aimed to assess how models handle ambiguous or noisy inputs and whether automated evaluation methods could reliably replace human judgment. The results indicated that while models are generally robust to minor acoustic variations, they remain sensitive to semantic ambiguities in parameter values. This insight is crucial for developers, as it points to specific areas in model training and architecture design that require improvement to enhance reliability in noisy, real-world environments.

Industry Impact

The implications of this research extend across the open-source community, industrial application, and future research directions. For the open-source community, the framework provides a reproducible and verifiable diagnostic tool that addresses the high cost and long development cycles associated with building large-scale audio corpora. Researchers can now rapidly assess the baseline tool-use capabilities of new multimodal models without the need for extensive manual data annotation. This democratization of evaluation tools accelerates the iteration cycle for model development and fosters a more competitive and transparent research environment.

From an industrial perspective, the study validates the use of open-source large language models as evaluators, offering a viable path for privacy-preserving assessment. The research found that open-source Qwen3 models with at least 8 billion parameters achieved over 80% consistency with proprietary model evaluations. This high level of agreement suggests that enterprises can utilize open-source models for internal evaluation of their speech agents, thereby avoiding the need to send sensitive data to external proprietary APIs. This capability significantly reduces the risk of data leakage and lowers operational costs, making it easier for organizations to deploy speech agents in sensitive domains such as healthcare and finance.

Furthermore, the framework's generality allows it to be easily extended to other multimodal tasks, promoting the development of more reliable and transparent speech agents. By providing a standardized method for evaluating tool-use capabilities in audio contexts, the research lays the technical groundwork for building truly practical voice assistants. This standardization is essential for the industry to move beyond experimental prototypes and achieve widespread adoption of speech-based AI in everyday applications, ensuring that these systems can handle the complexities of real-world interactions with confidence and accuracy.

Outlook

Looking ahead, the validation of this evaluation framework marks a significant step toward more rigorous testing of multimodal agents. The identification of parameter value misunderstanding as a primary bottleneck suggests that future research should focus on enhancing the integration of acoustic features with semantic parsing. Improving models' ability to disambiguate temporal and spatial references in speech could substantially reduce the Text-to-Voice Gap. Additionally, the success of using open-source models as judges indicates a trend toward decentralized and privacy-conscious evaluation ecosystems, which will likely become standard practice in industries handling sensitive information.

As the framework is adapted for broader use, it is expected to drive the creation of more diverse and challenging audio benchmarks. These benchmarks will likely incorporate more complex noise profiles, multilingual inputs, and dynamic interaction scenarios to better reflect real-world conditions. The insights gained from these expanded evaluations will inform the next generation of model architectures, leading to speech agents that are not only more accurate but also more robust and adaptable. Ultimately, this research paves the way for a new era of voice AI, where agents can seamlessly and reliably perform complex tasks in any acoustic environment, fulfilling the promise of truly intelligent and accessible voice interfaces.