BUT Systems for Environmental Sound Deepfake Detection in the ESDD 2026 Challenge
Authors: Junyi Peng, Lin Zhang, Jin Li, Oldrich Plchot, Jan Cernocky
Published: 2025-12-09 07:32:55+00:00
AI Summary
This paper details the BUT submission to the ESDD 2026 Challenge, focusing on detecting environmental sound deepfakes generated by unseen algorithms. The authors propose a robust ensemble system utilizing diverse pre-trained Self-Supervised Learning (SSL) models (like EAT and BEATs) as front-ends. These features are processed by a Multi-Head Factorized Attention (MHFA) back-end, enhanced by a feature domain augmentation strategy (DSU) to boost generalization performance.
Abstract
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models. We conduct a comprehensive analysis of general audio SSL models (including BEATs, EAT, and Dasheng) and speech-specific SSLs. These front-ends are coupled with a lightweight Multi-Head Factorized Attention (MHFA) back-end to capture discriminative representations. Furthermore, we introduce a feature domain augmentation strategy based on distribution uncertainty modeling to enhance model robustness against unseen spectral distortions. All models are trained exclusively on the official EnvSDD data, without using any external resources. Experimental results demonstrate the effectiveness of our approach: our best single system achieved Equal Error Rates (EER) of 0.00\\%, 4.60\\%, and 4.80\\% on the Development, Progress (Track 1), and Final Evaluation sets, respectively. The fusion system further improved generalization, yielding EERs of 0.00\\%, 3.52\\%, and 4.38\\% across the same partitions.