XMUspeech Systems for the ASVspoof 5 Challenge

Authors: Wangjie Li, Xingjia Xie, Yishuang Li, Wenhao Guan, Kaidi Wang, Pengyu Ren, Lin Li, Qingyang Hong

Published: 2025-09-05 15:16:48+00:00

AI Summary

The XMUspeech systems for the ASVspoof 5 Challenge focus on speech deepfake detection, noting that increased audio duration significantly improves performance. The approach integrates advanced models like AASIST, HM-Conformer, Hubert, and Wav2vec2 with an adaptive multi-scale feature fusion method and optimized one-class loss functions. Their final fusion system achieved competitive results in both closed (minDCF 0.4783, EER 20.45%) and open conditions (minDCF 0.2245, EER 9.36%).

Abstract

In this paper, we present our submitted XMUspeech systems to the speech deepfake detection track of the ASVspoof 5 Challenge. Compared to previous challenges, the audio duration in ASVspoof 5 database has significantly increased. And we observed that merely adjusting the input audio length can substantially improve system performance. To capture artifacts at multiple levels, we explored the performance of AASIST, HM-Conformer, Hubert, and Wav2vec2 with various input features and loss functions. Specifically, in order to obtain artifact-related information, we trained self-supervised models on the dataset containing spoofing utterances as the feature extractors. And we applied an adaptive multi-scale feature fusion (AMFF) method to integrate features from multiple Transformer layers with the hand-crafted feature to enhance the detection capability. In addition, we conducted extensive experiments on one-class loss functions and provided optimized configurations to better align with the anti-spoofing task. Our fusion system achieved a minDCF of 0.4783 and an EER of 20.45% in the closed condition, and a minDCF of 0.2245 and an EER of 9.36% in the open condition.


Key findings
A significant finding was that increasing the input audio duration to 10 seconds substantially improved system performance, with the baseline AASIST showing a 24% minDCF improvement. The integration of self-supervised models (Hubert, Wav2vec2) with an adaptive multi-scale feature fusion module and hand-crafted features led to robust performance and mitigated overfitting. The final fusion system achieved a minDCF of 0.4783 and EER of 20.45% in the closed condition, and a minDCF of 0.2245 and EER of 9.36% in the open condition, demonstrating significant improvements over baselines.
Approach
The approach utilizes backbone models such as AASIST, HM-Conformer, Hubert, and Wav2vec2, employing an Adaptive Multi-scale Feature Fusion (AMFF) method to integrate features from multiple Transformer layers with hand-crafted features like LFCC. Systems are trained with optimized one-class loss functions (OC-Softmax, SAMO), with SAMO adjusted to calculate similarity to the nearest speaker attractor during training for better alignment. A logistic regression-based ensemble method combines multiple subsystems for final deepfake detection.
Datasets
ASVspoof 5 Challenge database (training, development, progress datasets), Librispeech, LibriTTS (for generating additional spoofing data using HiFi-GAN, BigVGAN, and ReflowTTS).
Model(s)
AASIST, HM-Conformer, Hubert, Wav2vec2
Author countries
China