Authors: Yi Zhu, Brahmi Dwivedi, Jayaram Raghuram, Surya Koppisetti
Published: 2026-04-30 21:32:40+00:00
Comment: Accepted to ICML 2026
AI Summary
This paper introduces Alethia, a novel foundational audio encoder designed for various voice deepfake detection and localization tasks. It proposes a new pretraining recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. Alethia significantly outperforms state-of-the-art speech foundation models across 56 benchmark datasets, demonstrating superior robustness to real-world perturbations and zero-shot generalization to unseen domains like singing deepfakes.
Abstract
Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on $5$ different tasks with $56$ benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.
Key findings
Alethia consistently achieved lower Equal Error Rates (EER) and higher accuracy across 56 benchmark datasets compared to existing state-of-the-art speech foundation models, especially on challenging in-the-wild and perturbed deepfakes. It demonstrated superior zero-shot generalization to unseen domains, such as singing voice deepfakes and partially fake speech localization. Ablation studies confirmed that both the generative flow-matching objective and the bottleneck architecture were crucial for effectively capturing deepfake artifacts.
Approach
The approach involves pretraining an audio encoder, named Alethia, with a dual objective. It combines bottleneck masked embedding prediction, where the encoder predicts continuous multi-layer embeddings from a frozen teacher model, with flow-matching based spectrogram reconstruction to recover the unmasked spectrogram. This recipe is designed to capture discriminative deepfake artifacts more effectively than traditional discrete token prediction.
Datasets
For pretraining, 19k hours of quality-controlled speech data were used, comprising self-curated data (from CommonVoice augmented with TTS/VC) and public deepfake data from ASVspoof5 (train/dev), MLAAD, M-AILABS, TITW-hard, SpoofCeleb (train/dev), and ShiftySpeech. For evaluation, 56 benchmark datasets were used across five tasks: Speech Deepfake Detection (SDD, 50 datasets including ASVspoof series, Deepfake eval 2024, Codecfake), Singing Voice Deepfake Detection (SVDD, CtrSVDD), Partially Fake Speech Localization (PFSL, PartialSpoof, Half-Truth, LlamaPartialSpoof), Source Tracing (ST, ASVspoof5-ST), and Audio-Visual Deepfake Detection (AVDD, FakeAVCeleb, PolyGlotFake).
Model(s)
The proposed model is Alethia, developed in two sizes: Alethia-Base (400M parameters) and Alethia-Large (1B parameters). Its architecture consists of a 7-layer CNN feature extractor followed by a 24-layer (Base) or 48-layer (Large) transformer encoder. WavLM-Large and Wav2vec-XLSR-1B were used as frozen teacher models during pretraining. For downstream tasks, a 2-layer MLP classifier is attached to the Alethia encoder. Baselines for comparison included HuBERT-Large, WavLM-Large, Wav2vec-XLSR-300M, and Wav2vec-XLSR-1B.
Author countries
Canada