XMUspeech Systems for the ASVspoof 5 Challenge

View on arXiv ← Back to list

Authors: Wangjie Li, Xingjia Xie, Yishuang Li, Wenhao Guan, Kaidi Wang, Pengyu Ren, Lin Li, Qingyang Hong

Published: 2025-09-05 15:16:48+00:00

AI Summary

This paper presents XMUspeech systems for the ASVspoof 5 Challenge's speech deepfake detection track. The approach focuses on leveraging various state-of-the-art models (AASIST, HM-Conformer, Hubert, Wav2vec2) with adaptive multi-scale feature fusion and optimized one-class loss functions to improve detection accuracy, particularly with longer audio durations.

Abstract

In this paper, we present our submitted XMUspeech systems to the speech deepfake detection track of the ASVspoof 5 Challenge. Compared to previous challenges, the audio duration in ASVspoof 5 database has significantly increased. And we observed that merely adjusting the input audio length can substantially improve system performance. To capture artifacts at multiple levels, we explored the performance of AASIST, HM-Conformer, Hubert, and Wav2vec2 with various input features and loss functions. Specifically, in order to obtain artifact-related information, we trained self-supervised models on the dataset containing spoofing utterances as the feature extractors. And we applied an adaptive multi-scale feature fusion (AMFF) method to integrate features from multiple Transformer layers with the hand-crafted feature to enhance the detection capability. In addition, we conducted extensive experiments on one-class loss functions and provided optimized configurations to better align with the anti-spoofing task. Our fusion system achieved a minDCF of 0.4783 and an EER of 20.45% in the closed condition, and a minDCF of 0.2245 and an EER of 9.36% in the open condition.

Key findings

Increasing input audio length significantly improved performance. The fusion system achieved a minDCF of 0.4783 and EER of 20.45% in the closed condition, and a minDCF of 0.2245 and EER of 9.36% in the open condition. Adversarial attacks (A19, A20) and codec-10 proved challenging for the system.

Approach

The authors utilize several deep learning models for feature extraction, exploring the performance of AASIST, HM-Conformer, Hubert, and Wav2vec2. They employ adaptive multi-scale feature fusion to combine features from multiple transformer layers with handcrafted features. One-class loss functions (OC-Softmax and SAMO) are used for improved generalization.

Datasets

ASVspoof 5 Challenge database, Librispeech, LibriTTS

Model(s)

AASIST, HM-Conformer, Hubert, Wav2vec2

Author countries

China

← Previous