QAMO: Quality-aware Multi-centroid One-class Learning For Speech Deepfake Detection

Authors: Duc-Tuan Truong, Tianchi Liu, Ruijie Tao, Junjie Li, Kong Aik Lee, Eng Siong Chng

Published: 2025-09-25 02:27:49+00:00

Comment: 5 pages, 4 figures

AI Summary

This paper proposes QAMO (Quality-Aware Multi-Centroid One-Class Learning) for speech deepfake detection, which addresses the limitations of single-centroid one-class models by introducing multiple centroids, each representing a distinct speech quality subspace. This approach better models intra-class variability in bona fide speech and supports a multi-centroid ensemble scoring strategy for improved decision thresholding. QAMO achieves a 5.09% EER on the In-the-Wild dataset, outperforming previous one-class and quality-aware systems.

Abstract

Recent work shows that one-class learning can detect unseen deepfake attacks by modeling a compact distribution of bona fide speech around a single centroid. However, the single-centroid assumption can oversimplify the bona fide speech representation and overlook useful cues, such as speech quality, which reflects the naturalness of the speech. Speech quality can be easily obtained using existing speech quality assessment models that estimate it through Mean Opinion Score. In this paper, we propose QAMO: Quality-Aware Multi-Centroid One-Class Learning for speech deepfake detection. QAMO extends conventional one-class learning by introducing multiple quality-aware centroids. In QAMO, each centroid is optimized to represent a distinct speech quality subspaces, enabling better modeling of intra-class variability in bona fide speech. In addition, QAMO supports a multi-centroid ensemble scoring strategy, which improves decision thresholding and reduces the need for quality labels during inference. With two centroids to represent high- and low-quality speech, our proposed QAMO achieves an equal error rate of 5.09% in In-the-Wild dataset, outperforming previous one-class and quality-aware systems.


Key findings
QAMO significantly improves performance over conventional one-class learning and prior quality-aware systems, achieving an EER of 5.09% on the In-the-Wild dataset and best results on ASVspoof2021 DF and FoR when integrated with XLSR-Conformer-TCM. The multi-centroid modeling, combined with a quality classification loss and an ensemble-score inference strategy, proves crucial for enhancing robustness and achieving more balanced detection across diverse deepfake attacks and acoustic conditions.
Approach
QAMO extends conventional one-class learning by employing multiple quality-aware centroids, each optimized to represent a distinct speech quality subspace (e.g., high/low quality) through a quality-level classification objective using AM-Softmax loss. The system combines this with a modified OC-Softmax loss. For inference, QAMO utilizes a multi-centroid ensemble scoring strategy, averaging distances across all centroids to produce a stable countermeasure score without requiring explicit quality labels.
Datasets
ASVspoof2019 Logical Access (LA), ASVspoof2021 Logical Access (LA), ASVspoof2021 DeepFake (DF), In-the-Wild (ITW), Fake-or-Real (FoR) norm-test subset
Model(s)
XLSR-TCM-Conformer, XLSR-Nes2NetX
Author countries
Singapore, Hong Kong