Zero-Day Audio DeepFake Detection via Retrieval Augmentation and Profile Matching

Authors: Xuechen Liu, Xin Wang, Junichi Yamagishi

Published: 2025-09-26 00:55:45+00:00

AI Summary

This paper proposes a training-free retrieval-augmented framework for detecting zero-day audio deepfakes, addressing the challenge of novel synthesis methods unseen during training. The framework leverages knowledge representations and voice profile matching through retrieval and ensemble methods. It achieves performance comparable to supervised and fine-tuned baselines on the DeepFake-Eval-2024 benchmark without requiring additional model training.

Abstract

Modern audio deepfake detectors built on foundation models and large training datasets achieve promising detection performance. However, they struggle with zero-day attacks, where the audio samples are generated by novel synthesis methods that models have not seen from reigning training data. Conventional approaches fine-tune the detector, which can be problematic when prompt response is needed. This paper proposes a training-free retrieval-augmented framework for zero-day audio deepfake detection that leverages knowledge representations and voice profile matching. Within this framework, we propose simple yet effective retrieval and ensemble methods that reach performance comparable to supervised baselines and their fine-tuned counterparts on the DeepFake-Eval-2024 benchmark, without any additional model training. We also conduct ablation on voice profile attributes, and demonstrate the cross-database generalizability of the framework with introducing simple and training-free fusion strategies.


Key findings
The retrieval augmentation methods, particularly with majority voting and ratio-based scoring ensemble, are effective against zero-day attacks, achieving performance comparable to fine-tuned models without additional training. Hybrid retrieval, which combines CM and profile features, shows robust performance, with voice quality being a more impactful profile attribute. However, the framework's training-free approach, while strong in-domain, can be affected by domain mismatch in cross-database scenarios.
Approach
The proposed approach is a training-free retrieval-augmented framework that utilizes a knowledge database containing CM feature vectors, profile feature vectors, ground truth labels, and prediction scores. For a query, it retrieves k-nearest neighbors based on CM, profile, or hybrid features using cosine similarity, then aggregates this information using ensemble methods (Majority Voting, Ratio-based Scoring, Score-level Averaging) for a final prediction.
Datasets
DeepFake-Eval-2024 (DE2024), AI4T
Model(s)
Wav2Vec2.0 (SSL-based CM frontend) with an MLP (backend classifier), vox profile (for profile feature extraction)
Author countries
Japan