Improving Out-of-Domain Audio Deepfake Detection via Layer Selection and Fusion of SSL-Based Countermeasures

Authors: Pierre Serrano, Raphaël Duroselle, Florian Angulo, Jean-François Bonastre, Olivier Boeffard

Published: 2025-09-15 14:50:21+00:00

AI Summary

This paper addresses the challenge of generalizing audio deepfake detection systems based on frozen pre-trained self-supervised learning (SSL) encoders to out-of-domain (OOD) conditions. The authors conduct a layer-by-layer analysis of six different SSL models, compare single-layer pooling with multi-head factorized attentive pooling (MHFA), and demonstrate that score-level fusion of several encoders significantly enhances OOD generalization. This approach achieves state-of-the-art performance in OOD conditions with limited training data and no data augmentation.

Abstract

Audio deepfake detection systems based on frozen pre-trained self-supervised learning (SSL) encoders show a high level of performance when combined with layer-weighted pooling methods, such as multi-head factorized attentive pooling (MHFA). However, they still struggle to generalize to out-of-domain (OOD) conditions. We tackle this problem by studying the behavior of six different pre-trained SSLs, on four different test corpora. We perform a layer-by-layer analysis to determine which layers contribute most. Next, we study the pooling head, comparing a strategy based on a single layer with automatic selection via MHFA. We observed that selecting the best layer gave very good results, while reducing system parameters by up to 80%. A wide variation in performance as a function of test corpus and SSL model is also observed, showing that the pre-training strategy of the encoder plays a role. Finally, score-level fusion of several encoders improved generalization to OOD attacks.


Key findings
Intermediate layers consistently provide the most relevant features for audio deepfake detection, outperforming features from the output layer. Selecting an optimal single layer can achieve performance comparable to more complex pooling strategies like MHFA, with significantly reduced parameters. Score-level fusion of several complementary SSL encoders substantially improves generalization to OOD attacks, achieving competitive performance across diverse unseen conditions.
Approach
The authors analyze the behavior of six frozen pre-trained SSL encoders on four OOD test corpora, performing a layer-by-layer analysis to identify the most contributing layers. They compare a single-layer mean pooling strategy with automatic selection via multi-head factorized attentive pooling (MHFA) for the classification head. Finally, they employ score-level fusion of multiple SSL-based classifiers to improve generalization to unseen attacks.
Datasets
ASVspoof5-train-train, ASVspoof5-train-dev, ASVspoof5-dev, ASVspoof5-eval, InTheWild, MLAAD v5 + M-AILABS, LlamaPartialSpoof
Model(s)
Wav2vec 2.0 Base, WavLM Base, BEATs, Wav2vec 2.0 XLS-R, WavLM Large, MMS (as SSL backbones); Mean Pooling (MP) and Multi-head factorized attentive pooling (MHFA) (as classification heads).
Author countries
France