DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection

View on arXiv ← Back to list

Authors: Yassine El Kheir, Arnab Das, Yixuan Xiao, Xin Wang, Feidi Kallel, Enes Erdem Erdogan, Ngoc Thang Vu, Tim Polzehl, Sebastian Moeller

Published: 2026-04-09 16:47:18+00:00

Comment: Deepfense Toolkit

AI Summary

This paper introduces DeepFense, a comprehensive, open-source PyTorch toolkit designed for robust speech deepfake detection, integrating state-of-the-art architectures, loss functions, and augmentation pipelines. Through a large-scale evaluation of over 400 models, the authors demonstrate that the choice of pre-trained front-end feature extractor significantly impacts performance, and that high-performing models often exhibit severe biases regarding audio quality, speaker gender, and language. DeepFense aims to facilitate reproducible research and address challenges in real-world deployment by providing tools for equitable training data selection and front-end fine-tuning.

Abstract

Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.

Key findings

The choice of pre-trained front-end feature extractor is the most critical factor influencing detection performance, with Wav2Vec2 generally leading for speech but EAT excelling in environmental sound deepfake detection due to its specialized pre-training. Conversely, the back-end classifier has a negligible impact. Training datasets significantly affect generalization, where CodecFake proves highly transferable, but ADD23 leads to catastrophic generalization failure. High-performing models often show severe biases across audio quality, speaker gender, and language, highlighting an urgent need to prioritize fairness in training data selection.

Approach

The authors developed DeepFense, a unified, modular, and extensible PyTorch toolkit for deepfake audio detection. This framework allows for systematic, large-scale comparisons of different front-ends, back-ends, and training data by providing standardized implementations, over 100 recipes, and a unified training-evaluation pipeline. The toolkit's design enables researchers to easily experiment with various components and conduct comprehensive fairness analyses.

Datasets

ASVspoof 2019 (ASV19), ASVspoof 2021 LA (LA21), ASVspoof 2021 DF (DF21), ASVspoof 5 (ASV5), Audio Deepfake Detection (ADD) 2022-1, ADD 2022-2, ADD 2023-1, ADD 2023-2, In-the-Wild (ITW), MLAAD, CodecFake, HABLA, PartialSpoof, ODSS, ReplayDF, EnvSDD, CompSpoof, FakeMusicCaps, CtrSVDD.

Model(s)

Front-ends: Wav2Vec 2.0, WavLM, HuBERT, EAT, MERT, Whisper, BEATs, Wav2Vec2-BERT. Back-ends: AASIST, ECAPA-TDNN, RawNet2, Nes2Net, TCM, Multi-layer perceptron (MLP), Pool, BiCrossMamba-ST.

Author countries

Germany, Japan

← Previous