Technical Report of Nomi Team in the Environmental Sound Deepfake Detection Challenge 2026

Authors: Candy Olivia Mawalim, Haotian Zhang, Shogo Okada

Published: 2025-12-05 03:37:18+00:00

AI Summary

This paper presents the Nomi team's work for the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge, focusing on robustness against unseen generators and low-resource scenarios. They propose an Audio-Text Cross-Attention (ATCA) model that integrates semantic text captions with acoustic features. Their final ensemble system demonstrated competitive EER improvements over the BEATs+AASIST challenge baseline across both tracks.

Abstract

This paper presents our work for the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge. The challenge is based on the large-scale EnvSDD dataset that consists of various synthetic environmental sounds. We focus on addressing the complexities of unseen generators and low-resource black-box scenarios by proposing an audio-text cross-attention model. Experiments with individual and combined text-audio models demonstrate competitive EER improvements over the challenge baseline (BEATs+AASIST model).


Key findings
The individual ATCA model achieved competitive performance, notably reducing the EER in Track 1 (unseen generators) to 11.28% compared to the 13.20% baseline. The ensemble model (ATCA-ens) further improved results, achieving the best EERs of 11.22% for Track 1 and 11.98% for Track 2 (low-resource detection). This demonstrates that integrating semantic text is beneficial, especially for detecting deepfakes from unseen generative models.
Approach
The core approach is the Audio-Text Cross-Attention (ATCA) model, which uses encoded text captions (generated via audio captioning) as keys and values to guide and filter acoustic features (derived from BEATs and AASIST) serving as queries. The resulting fused features are processed by a GRU network for detection. The top performing entry utilized a stacked regression ensemble combining multiple ATCA variants and the baseline model, using RoBERTa features in the meta-learner stage.
Datasets
EnvSDD, which includes real sounds compiled from UrbanSound8K, TAU UAS 2019 Open Dev, TUT SED 2016, TUT SED 2017, DCASE 2023 Task7 Dev, and Clotho.
Model(s)
Audio-Text Cross-Attention (ATCA), AASIST, BEATs, Gated Recurrent Unit (GRU), RoBERTa (base), Gradient Boosting, Random Forest, Linear Regressors.
Author countries
Japan