Improving Out-of-Domain Audio Deepfake Detection via Layer Selection and Fusion of SSL-Based Countermeasures

View on arXiv ← Back to list

Authors: Pierre Serrano, Raphaël Duroselle, Florian Angulo, Jean-François Bonastre, Olivier Boeffard

Published: 2025-09-15 14:50:21+00:00

AI Summary

This paper addresses the challenge of out-of-domain (OOD) generalization in audio deepfake detection. It improves detection by performing a layer-wise analysis of self-supervised learning (SSL) encoders, selecting the most informative layers, and fusing multiple encoders for enhanced OOD performance.

Abstract

Audio deepfake detection systems based on frozen pre-trained self-supervised learning (SSL) encoders show a high level of performance when combined with layer-weighted pooling methods, such as multi-head factorized attentive pooling (MHFA). However, they still struggle to generalize to out-of-domain (OOD) conditions. We tackle this problem by studying the behavior of six different pre-trained SSLs, on four different test corpora. We perform a layer-by-layer analysis to determine which layers contribute most. Next, we study the pooling head, comparing a strategy based on a single layer with automatic selection via MHFA. We observed that selecting the best layer gave very good results, while reducing system parameters by up to 80%. A wide variation in performance as a function of test corpus and SSL model is also observed, showing that the pre-training strategy of the encoder plays a role. Finally, score-level fusion of several encoders improved generalization to OOD attacks.

Key findings

Intermediate layers of SSL encoders consistently provide the most relevant features for audio deepfake detection. Selecting the best layer significantly reduces parameters while maintaining high performance. Score-level fusion of multiple encoders improves OOD generalization, achieving state-of-the-art results in some cases.

Approach

The authors analyze the performance of different layers within six pre-trained SSL encoders for audio deepfake detection. They compare single-layer classifiers with a multi-head factorized attentive pooling (MHFA) approach. Finally, they fuse the scores from multiple encoders to improve OOD generalization.

Datasets

ASVspoof5 (train, dev, eval), InTheWild, MLAAD v5 + M-AILABS, LlamaPartialSpoof

Model(s)

Wav2vec 2.0 Base, WavLM Base, BEATs, Wav2vec 2.0 XLS-R, WavLM Large, MMS; Mean pooling and Multi-head factorized attentive pooling (MHFA) are used as classification heads.

Author countries

France

← Previous