Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models

View on arXiv ← Back to list

Authors: Sandipana Dowerah, Atharva Kulkarni, Ajinkya Kulkarni, Hoan My Tran, Joonas Kalda, Artem Fedorchenko, Benoit Fauve, Damien Lolive, Tanel Alumäe, Matthew Magimai Doss

Published: 2025-09-02 22:11:29+00:00

AI Summary

Speech DF Arena is the first comprehensive benchmark for audio deepfake detection, providing a toolkit for uniform evaluation across 14 datasets and standardized metrics. It includes a leaderboard to rank detection systems and highlights the need for extensive cross-domain evaluation due to high error rates in out-of-domain scenarios.

Abstract

Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, currently across 14 diverse datasets and attack scenarios, standardized evaluation metrics and protocols for reproducibility and transparency. It also includes a leaderboard to compare and rank the systems to help researchers and developers enhance their reliability and robustness. We include 14 evaluation sets, 12 state-of-the-art open-source and 3 proprietary detection systems. Our study presents many systems exhibiting high EER in out-of-domain scenarios, highlighting the need for extensive cross-domain evaluation. The leaderboard is hosted on Huggingface1 and a toolkit for reproducing results across the listed datasets is available on GitHub.

Key findings

No single model consistently outperforms others across all datasets. Many systems show high error rates in out-of-domain scenarios, emphasizing the need for robust cross-domain generalization. Mid-sized transformer-based models offer a good balance between accuracy, generalization, and efficiency.

Approach

The researchers created Speech DF Arena, a benchmark platform with a standardized toolkit for evaluating audio deepfake detection systems. This platform uses a variety of datasets and evaluation metrics (EER, pooled EER, accuracy, F1 score) to provide a comprehensive and reproducible assessment of different systems.

Datasets

ASVspoof 2019, 2021, 2024; ADD challenges 2022 and 2023 (Track 1 and 3, Round 1 and 2); CodecFake; Libri-SeVoc; SONAR; Fake or Real (FoR); DFADD; In-the-wild dataset.

Model(s)

XLSR+SLS, TCM, Nes2NetX, Wav2Vec2 AASIST, XLSR Mamba, Whisper MesoNet, Wav2Vec2 ECAPA, AASIST, WavLM ECAPA, RawGAT-ST, RawNet2, Hubert ECAPA; three proprietary systems: Whispeak, Syntra.io, and Resemble AI.

Author countries

Estonia, UAE, Switzerland, France, UK

← Previous