WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Authors: Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen

Published: 2025-10-06 19:17:18+00:00

Comment: Submitted to ICASSP 2026

AI Summary

This paper introduces WaveSP-Net, a novel parameter-efficient front-end for speech deepfake detection that combines prompt-tuning with classical signal processing transforms. It specifically proposes a Partial-WSPT-XLSR front-end that uses learnable wavelet filters to inject multi-resolution features into prompt embeddings for a frozen XLSR model, paired with a Mamba-based back-end. WaveSP-Net achieves state-of-the-art performance on challenging benchmarks while maintaining low trainable parameters.

Abstract

Modern front-end design for speech deepfake detection relies on full fine-tuning of large pre-trained models like XLSR. However, this approach is not parameter-efficient and may lead to suboptimal generalization to realistic, in-the-wild data types. To address these limitations, we introduce a new family of parameter-efficient front-ends that fuse prompt-tuning with classical signal processing transforms. These include FourierPT-XLSR, which uses the Fourier Transform, and two variants based on the Wavelet Transform: WSPT-XLSR and Partial-WSPT-XLSR. We further propose WaveSP-Net, a novel architecture combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba-based back-end. This design injects multi-resolution features into the prompt embeddings, which enhances the localization of subtle synthetic artifacts without altering the frozen XLSR parameters. Experimental results demonstrate that WaveSP-Net outperforms several state-of-the-art models on two new and challenging benchmarks, Deepfake-Eval-2024 and SpoofCeleb, with low trainable parameters and notable performance gains. The code and models are available at https://github.com/xxuan-acoustics/WaveSP-Net.


Key findings
WaveSP-Net significantly outperforms state-of-the-art models on Deepfake-Eval-2024, achieving a 10.58% EER (10.72% relative improvement over XLS-R-1B) and on SpoofCeleb with a 0.13% EER (13.33% relative improvement over WPT-XLSR). The approach is highly parameter-efficient, utilizing only 1.298% of the total parameters, demonstrating superior generalization and discriminative feature learning through wavelet-domain sparsification and learnable filters.
Approach
The approach proposes WaveSP-Net, which consists of a Partial-WSPT-XLSR front-end and a bidirectional Mamba-based back-end. The front-end enhances prompt embeddings by applying learnable wavelet decomposition, wavelet domain sparsification, and learnable wavelet reconstruction to a sparse subset of prompt tokens, injecting multi-resolution features without altering the frozen XLSR parameters. This process generates enhanced features that are then fed into the Mamba classifier for detection.
Datasets
Deepfake-Eval-2024, SpoofCeleb
Model(s)
XLSR (as a frozen backbone), Mamba (back-end classifier), Learnable Wavelet Transforms, Prompt Tuning
Author countries
Finland, Japan, China, Taiwan, Canada