SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
Authors: Hui Wang, Jinghua Zhao, Yifan Yang, Shujie Liu, Junyang Chen, Yanzhe Zhang, Shiwan Zhao, Jinyu Li, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin
Published: 2025-10-16 13:19:07+00:00
AI Summary
This paper introduces SpeechLLM-as-Judges, a novel paradigm that leverages large language models (LLMs) for structured and explanation-based speech quality evaluation. They develop SpeechEval, a large-scale multilingual dataset for four speech evaluation tasks, and train SQ-LLM, a speech-quality-aware LLM using chain-of-thought reasoning and reward optimization. SQ-LLM demonstrates strong, interpretable performance across diverse tasks and languages, highlighting the potential of this LLM-as-judge approach.
Abstract
Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.