BUT Systems for Environmental Sound Deepfake Detection in the ESDD 2026 Challenge

Authors: Junyi Peng, Lin Zhang, Jin Li, Oldrich Plchot, Jan Cernocky

Published: 2025-12-09 07:32:55+00:00

AI Summary

This paper details the BUT submission to the ESDD 2026 Challenge, focusing on detecting environmental sound deepfakes generated by unseen algorithms. The authors propose a robust ensemble system utilizing diverse pre-trained Self-Supervised Learning (SSL) models (like EAT and BEATs) as front-ends. These features are processed by a Multi-Head Factorized Attention (MHFA) back-end, enhanced by a feature domain augmentation strategy (DSU) to boost generalization performance.

Abstract

This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models. We conduct a comprehensive analysis of general audio SSL models (including BEATs, EAT, and Dasheng) and speech-specific SSLs. These front-ends are coupled with a lightweight Multi-Head Factorized Attention (MHFA) back-end to capture discriminative representations. Furthermore, we introduce a feature domain augmentation strategy based on distribution uncertainty modeling to enhance model robustness against unseen spectral distortions. All models are trained exclusively on the official EnvSDD data, without using any external resources. Experimental results demonstrate the effectiveness of our approach: our best single system achieved Equal Error Rates (EER) of 0.00\\%, 4.60\\%, and 4.80\\% on the Development, Progress (Track 1), and Final Evaluation sets, respectively. The fusion system further improved generalization, yielding EERs of 0.00\\%, 3.52\\%, and 4.38\\% across the same partitions.


Key findings
The ensemble system achieved the best results with EERs of 3.52% on the Progress set and 4.38% on the Final Evaluation set, representing a substantial improvement (over 63% relative) compared to baselines. The study confirmed that general audio SSL models (BEATs, EAT) possess stronger representation capabilities for unstructured environmental sound deepfake detection than speech-specific models (WavLM).
Approach
The approach uses diverse pre-trained Self-Supervised Learning (SSL) models (EAT, BEATs, Dasheng) for hierarchical feature extraction across all transformer layers. These features are then aggregated and classified using a Multi-Head Factorized Attention (MHFA) back-end. Robustness against unseen generators is achieved by integrating Distribution Uncertainty (DSU)-based feature domain augmentation into the MHFA value stream.
Datasets
EnvSDD (official challenge data), AudioSet-2M (AS2M) (used for fine-tuning/continued pre-training).
Model(s)
BEATs, EAT (Efficient Audio Transformer), Dasheng, WavLM, Multi-Head Factorized Attention (MHFA), MHFA-DSU.
Author countries
Czechia, USA, Hong Kong SAR