How Well Do Current Speech Deepfake Detection Methods Generalize to the Real World?

Authors: Daixian Li, Jun Xue, Yanzhen Ren, Zhuolin Yi, Yihuan Huang, Guanxiang Feng, Yi Chai

Published: 2026-03-06 03:18:16+00:00

Comment: Submitted to Interspeech 2026

AI Summary

This paper introduces ML-ITW, a multilingual in-the-wild dataset spanning 14 languages, seven social media platforms, and 180 public figures, to evaluate the generalization ability of speech deepfake detection methods in realistic scenarios. By evaluating three detection paradigms (end-to-end neural models, self-supervised feature-based methods, and audio large language models), the study reveals significant performance degradation across diverse languages and real-world acoustic conditions, highlighting the limited generalization of existing detectors.

Abstract

Recent advances in speech synthesis and voice conversion have greatly improved the naturalness and authenticity of generated audio. Meanwhile, evolving encoding, compression, and transmission mechanisms on social media platforms further obscure deepfake artifacts. These factors complicate reliable detection in real-world environments, underscoring the need for representative evaluation benchmarks. To this end, we introduce ML-ITW (Multilingual In-The-Wild), a multilingual dataset covering 14 languages, seven major platforms, and 180 public figures, totaling 28.39 hours of audio. We evaluate three detection paradigms: end-to-end neural models, self-supervised feature-based (SSL) methods, and audio large language models (Audio LLMs). Experimental results reveal significant performance degradation across diverse languages and real-world acoustic conditions, highlighting the limited generalization ability of existing detectors in practical scenarios. The ML-ITW dataset is publicly available.


Key findings
Existing deepfake detection models, despite achieving near-saturated performance on controlled benchmarks like ASVspoof2019-LA, suffer significant performance degradation on the multilingual and multi-platform ML-ITW dataset, with EERs rising to 40-50%. This degradation indicates that real-world transmission effects and diverse linguistic/acoustic conditions substantially undermine learned decision boundaries, and neither architectural sophistication nor large-scale pretraining alone guarantees robust generalization.
Approach
The authors introduce the ML-ITW dataset, comprising 28.39 hours of audio collected from various social media platforms, languages, and public figures. They then evaluate three categories of existing speech deepfake detection models (end-to-end, self-supervised representation-based, and Audio LLMs) on this new dataset, alongside ASVspoof2019-LA and ITW, to assess their generalization capabilities under real-world conditions.
Datasets
ML-ITW, ASVspoof2019-LA, ITW, ASVspoof5, CD-ADD, Codecfake, DFADD, FSW, SpeechFake, SpoofCeleb
Model(s)
LCNN, RawNet2, RawGAT-ST, LibriSeVoc, AASIST, XLSR+AASIST (with XLSR-300M frontend), ML SSLFG, XLSR+SLS, ALLM4ADD, HoliAntiSpoof, FT-GRPO
Author countries
China