What is TRADE and what problem does it solve?

TRADE is a novel Speech LLM architecture that integrates a Transducer branch shared with the audio encoder and leverages LLM hidden states as the prediction network, achieving frame-synchronous acoustic alignment coupled with language reasoning in streaming inference.

Why does TRADE matter for real-time voice applications?

It achieves 6.71% average WER on Open ASR and 8.40% in strict streaming mode, handles long-form audio without external segmentation, and improves end-of-utterance detection F1 by 0.03 — solving the key deployment bottleneck for Speech LLMs in latency-sensitive scenarios.

What should we watch for in TRADE's future development?

The architecture is expected to extend to multilingual and multimodal interaction, enable long-form audio processing on resource-constrained edge devices, and further address computational bottlenecks in long-context speech understanding.

TRADE：基於 Transducer 增強的語音大模型串流推論方案

針對語音大模型在串流推論中缺乏聲學幀對齊的問題，本文提出 TRADE 架構。透過引入與音訊編碼器共享的 Transducer 分支，並利用 LLM 隱藏狀態，實現了幀同步聲學對齊與語言推論的緊密結合。該方案採用雙詞彙表融合、區塊同步串流訓練及區域解碼器音訊注意力機制，有效降低記憶體佔用並消除訓練推論差異。實驗顯示，TRADE 在 Open ASR 榜單平均字詞錯誤率為 6.71%，且在長音訊任務中表現優異，顯著提升了話語結束檢測的準確性。