ASVspoof 5: Evaluation of Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech

Authors: Xin Wang, Héctor Delgado, Nicholas Evans, Xuechen Liu, Tomi Kinnunen, Hemlata Tak, Kong Aik Lee, Ivan Kukanov, Md Sahidullah, Massimiliano Todisco, Junichi Yamagishi

Published: 2026-01-07 14:01:10+00:00

AI Summary

This paper presents an overview and analysis of the ASVspoof 5 challenge results, which focused on detecting speech spoofing, deepfakes, and adversarial attacks using a new large-scale crowdsourced database. The challenge featured two tracks: stand-alone detection and spoofing-robust Automatic Speaker Verification (ASV). Analysis of 53 team submissions revealed that performance remains robust against some deepfakes but degrades significantly under adversarial attacks and neural codec compression schemes.

Abstract

ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake detection solutions. A significant change from previous challenge editions is a new crowdsourced database collected from a substantially greater number of speakers under diverse recording conditions, and a mix of cutting-edge and legacy generative speech technology. With the new database described elsewhere, we provide in this paper an overview of the ASVspoof 5 challenge results for the submissions of 53 participating teams. While many solutions perform well, performance degrades under adversarial attacks and the application of neural encoding/compression schemes. Together with a review of post-challenge results, we also report a study of calibration in addition to other principal challenges and outline a road-map for the future of ASVspoof.


Key findings
The adoption of pre-trained SSL foundation models in the open condition led to substantial performance improvements across all attack types. However, performance significantly degraded when systems encountered adversarial attacks (like Malafide and Malacopula) and specific neural encoding schemes (Encodec), or legacy attacks (e.g., concatenative MaryTTS). Furthermore, extensive cross-dataset testing revealed a pervasive lack of generalization, confirming that detection systems tend to overfit to the acoustic characteristics of the training dataset.
Approach
The paper analyzes solutions submitted to the ASVspoof 5 challenge, which required participants to design countermeasures (CMs) for stand-alone detection (Track 1) or spoofing-robust ASV (Track 2). Top-performing systems typically employed score fusion of multiple subsystems, utilizing pre-trained Self-Supervised Learning (SSL) models (e.g., WavLM, wav2vec 2.0) as acoustic frontends, combined with extensive data augmentation (e.g., RawBoost, codec simulation), and various deep classifiers.
Datasets
ASVspoof 5 (derived from Multilingual Librispeech/MLS), VoxCeleb 2, ASVspoof 2015, ASVspoof 2019 (LA), ASVspoof 2021 (LA and DF), In-the-wild (ITW), LibriSpeech, VCTK, Libri-Light.
Model(s)
AASIST, RawNet2, ResNet, ConvViT-Base, WavLM, wav2vec 2.0, GAT, MFA-Res2Net, LSTM, LCNN, GNN, Conformer.
Author countries
Japan, Spain, France, Finland, USA, Hong Kong, Singapore, India