Perturbed Public Voices (P$^{2}$V): A Dataset for Robust Audio Deepfake Detection

Authors: Chongyang Gao, Marco Postiglione, Isabel Gortner, Sarit Kraus, V. S. Subrahmanian

Published: 2025-08-13 17:54:55+00:00

AI Summary

This paper introduces Perturbed Public Voices (P$^{2}$V), an IRB-approved dataset designed for robust audio deepfake detection, capturing identity-consistent transcripts, environmental/adversarial noise, and state-of-the-art voice cloning. Experiments reveal significant vulnerabilities in 22 recent audio deepfake detectors when tested on P$^{2}$V, showing up to 43% performance degradation for models trained on existing benchmarks, while P$^{2}$V-trained models maintain robustness and generalize effectively.

Abstract

Current audio deepfake detectors cannot be trusted. While they excel on controlled benchmarks, they fail when tested in the real world. We introduce Perturbed Public Voices (P$^{2}$V), an IRB-approved dataset capturing three critical aspects of malicious deepfakes: (1) identity-consistent transcripts via LLMs, (2) environmental and adversarial noise, and (3) state-of-the-art voice cloning (2020-2025). Experiments reveal alarming vulnerabilities of 22 recent audio deepfake detectors: models trained on current datasets lose 43% performance when tested on P$^{2}$V, with performance measured as the mean of F1 score on deepfake audio, AUC, and 1-EER. Simple adversarial perturbations induce up to 16% performance degradation, while advanced cloning techniques reduce detectability by 20-30%. In contrast, P$^{2}$V-trained models maintain robustness against these attacks while generalizing to existing datasets, establishing a new benchmark for robust audio deepfake detection. P$^{2}$V will be publicly released upon acceptance by a conference/journal.


Key findings
Models trained on current datasets exhibit alarming vulnerabilities, losing up to 43% performance when tested on P$^{2}$V. Simple adversarial perturbations lead to up to 16% performance degradation, while advanced cloning techniques reduce detectability by 20-30%. In contrast, models trained on P$^{2}$V maintain robustness against these attacks and generalize well to existing datasets, establishing P$^{2}$V as a new benchmark for robust detection.
Approach
The authors introduce P$^{2}$V, a dataset generated to include three critical aspects of malicious deepfakes: identity-consistent transcripts via Large Language Models (LLMs), diverse environmental and adversarial noise, and audio generated using ten state-of-the-art voice cloning methods (2020-2025). This dataset serves as a benchmark for developing robust audio deepfake detection systems.
Datasets
Perturbed Public Voices (P$^{2}$V), In-The-Wild (ITW), ESC-50, EchoThief Impulse Response Library
Model(s)
RawNet3, LCNN, MesoNet, SpecRNet. Features used include Linear Frequency Cepstral Coefficients (LFCC), Mel-Frequency Cepstral Coefficients (MFCC), and Whisper features.
Author countries
USA, Israel