From Text to Speech: A Reproducible Evaluation Framework for Tool-Calling LLM Agents

A new dataset-agnostic framework transforms text benchmarks into controlled audio tool-call evaluations without requiring re-annotation of labels or usage patterns. By leveraging text-to-speech synthesis, speaker variation, and environmental noise, it generates paired text-audio instances while preserving original annotations. In evaluations of seven multimodal models across the Confetti and When2Call benchmarks, Gemini-3.1-Flash-Live led on Confetti (70.4) while GPT-Realtime-1.5 excelled on When2Call (71.9). Performance degradation stemmed primarily from misinterpretations of parameter values embedded in speech. Notably, the open-source Qwen3 judge model achieved over 80% agreement with proprietary alternatives, enabling privacy-preserving evaluation pipelines.

Background and Context

The rapid integration of large language models into voice-based interfaces has exposed a critical evaluation gap. While text-based benchmarks for tool-calling capabilities are mature, they fail to capture the complexities of real-world acoustic environments. Existing evaluations often assume perfect transcription, ignoring the noise, speaker variation, and prosodic nuances inherent in spoken interaction. This disconnect limits the ability of developers to assess how robust their multimodal agents are when deployed in uncontrolled settings. The research introduces a dataset-agnostic framework designed to bridge this divide by transforming established text benchmarks into controlled audio tool-call evaluations. This approach eliminates the need for costly and time-consuming re-annotation of tool schemas and gold labels, allowing researchers to leverage existing high-quality text datasets such as Confetti and When2Call.

The core innovation lies in the systematic conversion of text instructions into audio inputs using text-to-speech synthesis. By preserving the original annotation information, including tool names, parameters, and their specific values, the framework ensures semantic consistency between the text and audio modalities. This method significantly lowers the barrier to entry for building audio benchmarks, providing a standardized testing bed for evaluating full multimodal large language models. The study aims to quantify the performance degradation that occurs when moving from text to speech, thereby identifying specific weaknesses in speech understanding rather than logical reasoning errors. This shift from text simulation to real audio scenarios marks a pivotal step in the maturation of speech agent technology.

Deep Analysis

The evaluation methodology employs a rigorous technical pipeline to simulate realistic voice interactions. The framework utilizes text-to-speech engines to generate audio inputs, introducing deliberate variations in speaker identity and environmental noise to test model robustness. This design choice ensures that the generated audio instances are not merely synthetic replicas but challenging test cases that reflect the variability of human speech. The study evaluated seven prominent multimodal models, including Gemini-3.1-Flash-Live, GPT-Realtime-1.5, and Qwen3-Omni, across the Confetti and When2Call benchmarks. The results demonstrated a strong dependency on both the model architecture and the specific task. For instance, Gemini-3.1-Flash-Live achieved the highest score on Confetti with 70.4 points, while GPT-Realtime-1.5 led on When2Call with 71.9 points.

A detailed analysis of the performance gap between text and voice modalities revealed that the primary cause of degradation is not a failure in tool-calling logic, but rather a misunderstanding of parameter values embedded in the speech. Models often struggle to accurately extract and interpret numerical or categorical parameters when presented in an audio format, leading to incorrect tool execution. The study also conducted ambiguity-based rephrasing stress tests to evaluate how models handle vague or complex instructions. These tests further highlighted the sensitivity of current models to acoustic distortions and speaker variations. Ablation experiments confirmed that the introduction of noise and speaker diversity significantly impacts performance, validating the framework's ability to expose vulnerabilities that text-only benchmarks miss.

To streamline the evaluation process, the research implemented a reference-free LLM-as-judge protocol. This automated judging system was validated against human preference judgments, ensuring its reliability. A key finding of this validation was the high consistency between the open-source Qwen3 judge model and proprietary judge models, achieving an agreement rate of over 80%. This result is particularly significant as it suggests that open-source models can serve as effective proxies for proprietary ones in automated evaluation pipelines. The use of LLM-as-judge reduces the reliance on manual annotation, enabling scalable and reproducible assessments. The text-only baseline results provided a clear reference point, allowing the team to isolate the specific impact of the audio modality on model performance.

Industry Impact

The introduction of this reproducible evaluation framework has profound implications for both the open-source community and industrial developers. By providing a standardized method for assessing tool-calling capabilities in voice environments, it facilitates fair comparisons between different multimodal models. This standardization is crucial for driving competition and innovation in the field. For industrial applications, the framework helps developers accurately gauge the readiness of their models for real-world deployment. It highlights specific areas of weakness, such as parameter extraction in noisy environments, allowing for targeted improvements. The ability to evaluate models without re-annotating datasets accelerates the development cycle, enabling faster iteration and optimization.

Furthermore, the framework supports privacy-preserving evaluation practices. The high agreement between the open-source Qwen3 judge and proprietary models means that companies can use open-source judges to evaluate their models without exposing sensitive data to proprietary APIs. This reduces the risk of data leakage and lowers the cost of evaluation. The findings also inform the design of future speech agents, emphasizing the need for improved speech-to-text accuracy and robust parameter extraction mechanisms. By shifting the focus from text-based logic to audio-based understanding, the research encourages the development of models that are truly capable of handling the complexities of spoken language. This shift is essential for creating voice agents that can operate reliably in diverse and challenging acoustic environments.

The industry impact extends to the broader ecosystem of AI research. The framework provides a reusable infrastructure that can be adapted to new benchmarks and tasks. This flexibility ensures that the evaluation methods remain relevant as new models and challenges emerge. The emphasis on reproducibility and verification sets a new standard for benchmarking in the multimodal AI space. It encourages researchers to move beyond simple accuracy metrics and consider the robustness and reliability of their models in real-world scenarios. This holistic approach to evaluation is critical for building trust in AI systems and ensuring their safe and effective deployment.

Outlook

Looking ahead, the framework establishes a new paradigm for evaluating speech agents, moving beyond the limitations of text-based benchmarks. The identification of parameter value misunderstanding as a primary failure mode points to a clear direction for future research and development. Improving the robustness of speech recognition and parameter extraction in noisy environments will be a key priority for model developers. The high consistency of the open-source Qwen3 judge suggests that automated, privacy-preserving evaluation will become more prevalent, reducing the dependency on proprietary tools. This trend could democratize access to high-quality evaluation metrics, fostering greater innovation in the open-source community.

The success of this framework in revealing the text-to-voice performance gap underscores the need for more sophisticated multimodal models. Future iterations of this research may explore more complex acoustic scenarios, such as overlapping speech or heavy background noise, to further stress-test model capabilities. The integration of additional stress tests, such as those based on ambiguity, will likely become standard practice in the evaluation of voice agents. As the field evolves, the ability to seamlessly convert text benchmarks into audio evaluations will be invaluable for keeping pace with the rapid development of new models.

Ultimately, this research contributes to the broader goal of creating reliable and trustworthy AI agents. By providing a rigorous and reproducible method for assessing tool-calling capabilities in voice, it helps bridge the gap between theoretical performance and practical utility. The framework serves as a foundational tool for the next generation of voice AI, enabling developers to build systems that are not only intelligent but also robust and reliable in real-world conditions. As voice interfaces become increasingly ubiquitous, the importance of such evaluation frameworks will only grow, ensuring that AI systems can meet the demands of users in diverse and dynamic environments.