Improving Out-of-Domain Audio Deepfake Detection via Layer Selection and Fusion of SSL-Based Countermeasures
Authors: Pierre Serrano, Raphaël Duroselle, Florian Angulo, Jean-François Bonastre, Olivier Boeffard
Published: 2025-09-15 14:50:21+00:00
AI Summary
This paper addresses the challenge of generalizing audio deepfake detection systems based on frozen pre-trained self-supervised learning (SSL) encoders to out-of-domain (OOD) conditions. The authors conduct a layer-by-layer analysis of six different SSL models, compare single-layer pooling with multi-head factorized attentive pooling (MHFA), and demonstrate that score-level fusion of several encoders significantly enhances OOD generalization. This approach achieves state-of-the-art performance in OOD conditions with limited training data and no data augmentation.
Abstract
Audio deepfake detection systems based on frozen pre-trained self-supervised learning (SSL) encoders show a high level of performance when combined with layer-weighted pooling methods, such as multi-head factorized attentive pooling (MHFA). However, they still struggle to generalize to out-of-domain (OOD) conditions. We tackle this problem by studying the behavior of six different pre-trained SSLs, on four different test corpora. We perform a layer-by-layer analysis to determine which layers contribute most. Next, we study the pooling head, comparing a strategy based on a single layer with automatic selection via MHFA. We observed that selecting the best layer gave very good results, while reducing system parameters by up to 80%. A wide variation in performance as a function of test corpus and SSL model is also observed, showing that the pre-training strategy of the encoder plays a role. Finally, score-level fusion of several encoders improved generalization to OOD attacks.