Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative

Authors: Xi Xuan, Zimo Zhu, Wenxin Zhang, Yi-Cheng Lin, Tomi Kinnunen

Published: 2025-08-12 19:15:13+00:00

AI Summary

Fake-Mamba is a real-time speech deepfake detection framework that replaces self-attention with bidirectional Mamba, a state-space model. It introduces three efficient encoders (TransBiMamba, ConBiMamba, and PN-BiMamba), achieving state-of-the-art performance on ASVspoof and In-The-Wild benchmarks while maintaining real-time inference.

Abstract

Advances in speech synthesis intensify security threats, motivating real-time deepfake detection research. We investigate whether bidirectional Mamba can serve as a competitive alternative to Self-Attention in detecting synthetic speech. Our solution, Fake-Mamba, integrates an XLSR front-end with bidirectional Mamba to capture both local and global artifacts. Our core innovation introduces three efficient encoders: TransBiMamba, ConBiMamba, and PN-BiMamba. Leveraging XLSR's rich linguistic representations, PN-BiMamba can effectively capture the subtle cues of synthetic speech. Evaluated on ASVspoof 21 LA, 21 DF, and In-The-Wild benchmarks, Fake-Mamba achieves 0.97%, 1.74%, and 5.85% EER, respectively, representing substantial relative gains over SOTA models XLSR-Conformer and XLSR-Mamba. The framework maintains real-time inference across utterance lengths, demonstrating strong generalization and practical viability. The code is available at https://github.com/xuanxixi/Fake-Mamba.


Key findings
PN-BiMamba consistently outperforms other encoders. Fake-Mamba achieves state-of-the-art performance on ASVspoof 2021 and In-The-Wild datasets, significantly improving upon existing models. The model maintains real-time inference across various utterance lengths.
Approach
Fake-Mamba uses an XLSR front-end for feature extraction, followed by one of three proposed bidirectional Mamba encoders. These encoders capture both local and global artifacts, leveraging the efficiency and global receptive field of Mamba. Utterance-level pooling and a multi-layer perceptron classifier produce the final deepfake prediction.
Datasets
ASVspoof 2019 LA (training), ASVspoof 2021 LA, ASVspoof 2021 DF, In-The-Wild
Model(s)
XLSR (front-end), TransBiMamba, ConBiMamba, PN-BiMamba (bidirectional Mamba encoders)
Author countries
Finland, USA, China, Canada, Taiwan