SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

Authors: Hui Wang, Jinghua Zhao, Yifan Yang, Shujie Liu, Junyang Chen, Yanzhe Zhang, Shiwan Zhao, Jinyu Li, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin

Published: 2025-10-16 13:19:07+00:00

AI Summary

This paper presents SpeechLLM-as-Judges, a novel paradigm leveraging Large Language Models (LLMs) for general, structured, and explanation-based speech quality evaluation across diverse tasks. They introduce SpeechEval, a large-scale multilingual dataset spanning four evaluation tasks, and develop SQ-LLM, a speech-quality-aware LLM trained with Chain-of-Thought (CoT) reasoning and reward optimization. SQ-LLM demonstrates strong performance, interpretability, and generalization across multiple evaluation scenarios, including deepfake detection.

Abstract

Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.


Key findings
SQ-LLM achieved the best overall performance compared to strong multimodal LLM and expert system baselines across all four tasks, demonstrating superior agreement with human ratings (highest PCC/ACC) and text generation metrics. In deepfake detection (DSD), SQ-LLM set the state-of-the-art results with an EER of 6.249% and minDCF of 0.142. The combination of CoT reasoning and GRPO optimization was critical for achieving robustness and improving subjective judgment capabilities.
Approach
The authors develop SQ-LLM, a unified LLM built on Qwen2.5-Omni, tailored for instruction-based speech quality tasks. Training involves two stages: instruction tuning with CoT reasoning, utilizing predefined quality dimensions as intermediate supervision signals, followed by reward optimization via GRPO using a multi-aspect reward evaluator (Helpfulness, Relevance, Accuracy, Detail).
Datasets
SpeechEval (32,207 speech clips, 128,754 annotations), augmented with samples from ASVspoof2019-LA, VCC2018, BC2019, and BVCC.
Model(s)
SQ-LLM (built on Qwen2.5-Omni-7B), utilizing instruction tuning and GRPO.
Author countries
China, US