Environmental Sound Deepfake Detection Challenge: An Overview

Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

Published: 2025-12-30 11:03:36+00:00

AI Summary

This paper introduces EnvSDD, the first large-scale curated dataset designed for Environmental Sound Deepfake Detection (ESDD), addressing the limitations of prior small-scale resources. It provides an overview of the ICASSP 2026 ESDD Challenge, which utilized EnvSDD across two tracks focusing on detection robustness against unseen and black-box audio generators. The paper analyzes the strategies and results of the top-performing systems.

Abstract

Recent progress in audio generation models has made it possible to create highly realistic and immersive soundscapes, which are now widely used in film and virtual-reality-related applications. However, these audio generators also raise concerns about potential misuse, such as producing deceptive audio for fabricated videos or spreading misleading information. Therefore, it is essential to develop effective methods for detecting fake environmental sounds. Existing datasets for environmental sound deepfake detection (ESDD) remain limited in both scale and the diversity of sound categories they cover. To address this gap, we introduced EnvSDD, the first large-scale curated dataset designed for ESDD. Based on EnvSDD, we launched the ESDD Challenge, recognized as one of the ICASSP 2026 Grand Challenges. This paper presents an overview of the ESDD Challenge, including a detailed analysis of the challenge results.


Key findings
Top teams achieved very low Equal Error Rates (EERs), reaching 0.30% in Track 1 (unseen generators) and 0.25% in Track 2 (black-box low-resource setting), greatly surpassing baseline performance. The results indicate that current deepfake detection methods, especially those leveraging SSL features and ensembles, show strong effectiveness in generalizing to diverse and unseen generation frameworks like video-to-audio models.
Approach
The most successful approaches combined Self-Supervised Learning (SSL) models (like BEATs or EAT) as robust front-ends for feature extraction with specialized back-end classifiers, often derived from AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention networks). Generalization was significantly improved through targeted data augmentation, domain adversarial training, and model ensemble techniques.
Datasets
EnvSDD, AudioCaps
Model(s)
AASIST, BEATs, EAT, SSLAM, BiCrossMamba-ST
Author countries
Republic of Korea, Australia, Singapore, China