AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

View on arXiv ← Back to list

Authors: Yuankun Xie, Haonan Cheng, Jiayi Zhou, Xiaoxuan Guo, Tao Wang, Jian Liu, Weiqiang Wang, Ruibo Fu, Xiaopeng Wang, Hengyan Huang, Xiaoying Huang, Long Ye, Guangtao Zhai

Published: 2026-04-09 12:38:19+00:00

Comment: Accepted to the ACM Multimedia 2026 Grand Challenge

AI Summary

This paper proposes the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge, designed to advance robust and generalizable audio deepfake detection (ADD) technologies. The challenge comprises two tracks: Robust Speech Deepfake Detection, focusing on real-world scenarios and unseen speech generation methods, and All-Type Audio Deepfake Detection, extending beyond speech to diverse audio types like sound, singing, and music. AT-ADD aims to bridge the gap between academic evaluation and practical multimedia forensics by providing standardized datasets, rigorous evaluation protocols, and reproducible baselines.

Abstract

The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech-centric, often relying on speech-specific artifacts and exhibiting limited robustness to real-world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT-ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real-world scenarios and against unseen, state-of-the-art speech generation methods; and (2) All-Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type-agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT-ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.

Key findings

Baseline evaluations demonstrate that SSL-based CMs, particularly FT-XLSR-AASIST, achieve the strongest performance on both challenge tracks, significantly outperforming conventional and ALLM-based baselines. While ALLM-based models show competitive performance and balanced results across audio types, indicating potential for unified modeling, Track 2 highlights remaining challenges in achieving uniform cross-type generalization, especially for sound and music, where scores are lower compared to speech and singing.

Approach

The paper's main contribution is the proposal and detailed evaluation plan for the AT-ADD Grand Challenge. This challenge is structured into two tracks: Track 1 focuses on robust speech deepfake detection under real-world conditions with diverse recording devices, acoustic environments, and signal perturbations, while Track 2 targets all-type audio deepfake detection across speech, sound, singing, and music, emphasizing generalization to unknown audio types and unseen generation methods. The approach involves defining tasks, constructing comprehensive datasets, establishing evaluation metrics, and providing baseline models.

Datasets

The challenge uses two custom-built datasets: AT-ADD Track 1 and AT-ADD Track 2. These datasets are composed of real audio from various public sources including AISHELL-3, LibriTTS-R, LJSpeech, Common Voice, AudioCaps, OpenCpop, M4Singer, KiSing, and MusicCaps, along with internally recorded data. Synthetic audio samples are generated using over 40 speech deepfake generators and 70+ general audio deepfake generators spanning vocoder-based, neural codec-based, diffusion-based, text-to-audio, audio-to-audio, and singing voice conversion paradigms.

Model(s)

The paper provides baseline models categorized into three types: Conventional CMs (Spec-ResNet, AASIST), SSL-based CMs (FT-XLSR-AASIST using Wav2Vec2-XLSR, WPT-XLSR-AASIST), and ALLM-based CMs (Qwen2.5-Omni-3B, Qwen2.5-Omni-7B).

Author countries

China

← Previous