XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection

Authors: Kwok-Ho Ng, Tingting Song, Yongdong WU, Zhihua Xia

Published: 2026-01-06 11:41:05+00:00

AI Summary

XLSR-MamBo is a novel modular framework for Audio Deepfake Detection (ADD) that integrates a pre-trained XLSR front-end with hybrid State Space Model (SSM) and Attention backbones. The framework systematically explores four topological designs and scaling depths using advanced SSM variants like Mamba, Mamba2, Hydra, and Gated DeltaNet. The best configuration, MamBo-3-Hydra-N3, achieves competitive performance on ASVspoof 2021 LA, DF, and cross-dataset benchmarks, demonstrating robust generalization.

Abstract

Advanced speech synthesis technologies have enabled highly realistic speech generation, posing security risks that motivate research into audio deepfake detection (ADD). While state space models (SSMs) offer linear complexity, pure causal SSMs architectures often struggle with the content-based retrieval required to capture global frequency-domain artifacts. To address this, we explore the scaling properties of hybrid architectures by proposing XLSR-MamBo, a modular framework integrating an XLSR front-end with synergistic Mamba-Attention backbones. We systematically evaluate four topological designs using advanced SSM variants, Mamba, Mamba2, Hydra, and Gated DeltaNet. Experimental results demonstrate that the MamBo-3-Hydra-N3 configuration achieves competitive performance compared to other state-of-the-art systems on the ASVspoof 2021 LA, DF, and In-the-Wild benchmarks. This performance benefits from Hydra's native bidirectional modeling, which captures holistic temporal dependencies more efficiently than the heuristic dual-branch strategies employed in prior works. Furthermore, evaluations on the DFADD dataset demonstrate robust generalization to unseen diffusion- and flow-matching-based synthesis methods. Crucially, our analysis reveals that scaling backbone depth effectively mitigates the performance variance and instability observed in shallower models. These results demonstrate the hybrid framework's ability to capture artifacts in spoofed speech signals, providing an effective method for ADD.


Key findings
The MamBo-3-Hydra-N3 configuration achieved competitive performance, including EERs of 0.81% on ASV21LA and 4.97% on ITW, outperforming several contemporary single SOTA systems. The Hydra variant, utilizing native bidirectional modeling, showed strong generalization capabilities against unseen diffusion- and flow-matching-based synthesis methods (DFADD). Crucially, increasing the backbone depth was found to effectively mitigate high performance variance and inference instability in comparison to shallower models.
Approach
The model uses the XLSR architecture as a front-end feature extractor to generate high-level speech representations. These features are processed by the MamBo back-end, which alternates or combines SSM blocks (e.g., Hydra) and Attention layers to leverage their complementary strengths in capturing local temporal and global spectral artifacts. The architecture explores depth scaling (N) and four distinct hybrid layer combinations to optimize the detection of forgery traces.
Datasets
ASVspoof 2019 LA (ASV19LA), ASVspoof 2021 LA (ASV21LA), ASVspoof 2021 DF (ASV21DF), In-the-Wild (ITW), DFADD.
Model(s)
XLSR, Mamba, Mamba2, Hydra, Gated DeltaNet (GDN), hybrid SSM-Attention architectures (MamBo variants).
Author countries
China