EnvSSLAM-FFN: Lightweight Layer-Fused System for ESDD 2026 Challenge

Authors: Xiaoxuan Guo, Hengyan Huang, Jiayi Zhou, Renhe Sun, Jian Liu, Haonan Cheng, Long Ye, Qin Zhang

Published: 2025-12-23 13:54:02+00:00

AI Summary

The paper proposes EnvSSLAM-FFN, a lightweight system designed for the ESDD 2026 Challenge focused on environmental sound deepfake detection under unseen generator and low-resource conditions. This system integrates a frozen SSLAM self-supervised encoder with a lightweight FFN back-end, utilizing layer fusion and a class-weighted objective. EnvSSLAM-FFN significantly outperforms official baselines, achieving low Equal Error Rates (EERs) on both challenge tracks.

Abstract

Recent advances in generative audio models have enabled high-fidelity environmental sound synthesis, raising serious concerns for audio security. The ESDD 2026 Challenge therefore addresses environmental sound deepfake detection under unseen generators (Track 1) and black-box low-resource detection (Track 2) conditions. We propose EnvSSLAM-FFN, which integrates a frozen SSLAM self-supervised encoder with a lightweight FFN back-end. To effectively capture spoofing artifacts under severe data imbalance, we fuse intermediate SSLAM representations from layers 4-9 and adopt a class-weighted training objective. Experimental results show that the proposed system consistently outperforms the official baselines on both tracks, achieving Test Equal Error Rates (EERs) of 1.20% and 1.05%, respectively.


Key findings
EnvSSLAM-FFN consistently reduced the EER compared to baseline systems (AASIST and BEATs+AASIST) across the challenge tracks. The system achieved a Test EER of 1.20% on Track 1 (unseen generators) and 1.05% on Track 2 (black-box low-resource detection), demonstrating effective detection and adaptation capabilities through intermediate layer fusion.
Approach
The system utilizes a frozen SSLAM encoder, fusing intermediate representations from layers 4–9 to capture specific spoofing artifacts, followed by a lightweight FFN back-end. Temporal information is aggregated using attentive statistics pooling, and training employs a class-weighted binary cross-entropy loss to mitigate severe label imbalance.
Datasets
EnvSDD dataset
Model(s)
SSLAM encoder, Feed-Forward Network (FFN), Attentive Statistics Pooling
Author countries
China