WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Authors: Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen

Published: 2025-10-06 19:17:18+00:00

AI Summary

WaveSP-Net is a novel, parameter-efficient architecture for speech deepfake detection combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba back-end. This design utilizes learnable wavelet filters to create sparse, multi-resolution prompt embeddings, enhancing artifact localization without fine-tuning the frozen XLSR backbone. The approach achieves state-of-the-art performance on challenging benchmarks while maintaining low trainable parameter counts.

Abstract

Modern front-end design for speech deepfake detection relies on full fine-tuning of large pre-trained models like XLSR. However, this approach is not parameter-efficient and may lead to suboptimal generalization to realistic, in-the-wild data types. To address these limitations, we introduce a new family of parameter-efficient front-ends that fuse prompt-tuning with classical signal processing transforms. These include FourierPT-XLSR, which uses the Fourier Transform, and two variants based on the Wavelet Transform: WSPT-XLSR and Partial-WSPT-XLSR. We further propose WaveSP-Net, a novel architecture combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba-based back-end. This design injects multi-resolution features into the prompt embeddings, which enhances the localization of subtle synthetic artifacts without altering the frozen XLSR parameters. Experimental results demonstrate that WaveSP-Net outperforms several state-of-the-art models on two new and challenging benchmarks, Deepfake-Eval-2024 and SpoofCeleb, with low trainable parameters and notable performance gains. The code and models are available at https://github.com/xxuan-acoustics/WaveSP-Net.


Key findings
WaveSP-Net achieved state-of-the-art performance, recording EERs of 10.58% on Deepfake-Eval-2024 and 0.13% on SpoofCeleb, significantly outperforming other systems while training only 1.298% of the total parameters. The integration of learnable wavelet filters combined with Wavelet Domain Sparsification (WDS) was crucial, with WDS removal causing the most significant performance degradation.
Approach
The method uses Parameter-Efficient Fine-Tuning (PEFT) by freezing the XLSR feature extractor and optimizing a sparse subset of prompt tokens using learnable wavelet transforms (Partial-WSPT). This process injects multi-resolution and sparse representations of artifacts into the prompt embeddings. The features extracted are then passed to a bidirectional Mamba-based classifier for final detection.
Datasets
Deepfake-Eval-2024, SpoofCeleb
Model(s)
XLSR (frozen backbone), Partial-WSPT-XLSR (prompt tuning front-end), Bidirectional Mamba (classifier)
Author countries
Finland, Japan, China, Taiwan, Canada