SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

Authors: Hui Wang, Jinghua Zhao, Yifan Yang, Shujie Liu, Junyang Chen, Yanzhe Zhang, Shiwan Zhao, Jinyu Li, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin

Published: 2025-10-16 13:19:07+00:00

AI Summary

This paper introduces SpeechLLM-as-Judges, a novel paradigm that leverages large language models (LLMs) for structured and explanation-based speech quality evaluation. They develop SpeechEval, a large-scale multilingual dataset for four speech evaluation tasks, and train SQ-LLM, a speech-quality-aware LLM using chain-of-thought reasoning and reward optimization. SQ-LLM demonstrates strong, interpretable performance across diverse tasks and languages, highlighting the potential of this LLM-as-judge approach.

Abstract

Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.


Key findings
SQ-LLM consistently achieved state-of-the-art performance across all four evaluation tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. It demonstrated strong alignment with human judgments, robust multilingual capabilities, and high interpretability. The ablation studies confirmed that both Chain-of-Thought reasoning and GRPO-based reward optimization were crucial for the model's superior accuracy, interpretability, and generalization across tasks and languages.
Approach
The authors developed SQ-LLM, a speech-quality-aware large language model built upon Qwen2.5-Omni, which integrates a speech encoder and an LLM decoder. It is trained on the new SpeechEval dataset, featuring over 32,000 multilingual speech clips and 128,000 annotations for tasks like quality assessment, comparison, improvement suggestion, and deepfake detection. The training regimen involves instruction tuning with Chain-of-Thought reasoning and refinement through reward optimization via Generalized Policy Gradient Optimization (GRPO).
Datasets
SpeechEval (created by the authors, integrating samples from various public corpora and synthetic speech systems), ASVspoof2019-LA, VCC2018, BC2019, BVCC, NISQA, QualiSpeech, ALLD-dataset.
Model(s)
SQ-LLM (based on Qwen2.5-Omni, with a speech encoder and LLM decoder). Baselines include Qwen2-Audio-7B-Instruct, MiDashengLM-7B, Qwen3-8B + Whisper, Qwen2.5 + Audiobox, Qwen3-4B + WavLM, RawNet2, AASIST, AASIST2.
Author countries
China, USA