Perturbed Public Voices (P$^{2}$V): A Dataset for Robust Audio Deepfake Detection

View on arXiv ← Back to list

Authors: Chongyang Gao, Marco Postiglione, Isabel Gortner, Sarit Kraus, V. S. Subrahmanian

Published: 2025-08-13 17:54:55+00:00

AI Summary

The paper introduces Perturbed Public Voices (P²V), a new dataset for robust audio deepfake detection that addresses the limitations of existing datasets by incorporating realistic noise, identity-consistent transcripts, and state-of-the-art voice cloning techniques. Experiments show that models trained on P²V are more robust to adversarial attacks and generalize better to other datasets than models trained on existing benchmarks.

Abstract

Current audio deepfake detectors cannot be trusted. While they excel on controlled benchmarks, they fail when tested in the real world. We introduce Perturbed Public Voices (P$^{2}$V), an IRB-approved dataset capturing three critical aspects of malicious deepfakes: (1) identity-consistent transcripts via LLMs, (2) environmental and adversarial noise, and (3) state-of-the-art voice cloning (2020-2025). Experiments reveal alarming vulnerabilities of 22 recent audio deepfake detectors: models trained on current datasets lose 43% performance when tested on P$^{2}$V, with performance measured as the mean of F1 score on deepfake audio, AUC, and 1-EER. Simple adversarial perturbations induce up to 16% performance degradation, while advanced cloning techniques reduce detectability by 20-30%. In contrast, P$^{2}$V-trained models maintain robustness against these attacks while generalizing to existing datasets, establishing a new benchmark for robust audio deepfake detection. P$^{2}$V will be publicly released upon acceptance by a conference/journal.

Key findings

Models trained on existing datasets experienced a significant performance drop (43%) when tested on P²V. Simple adversarial perturbations caused up to 16% performance degradation, while advanced cloning techniques reduced detectability by 20-30%. In contrast, P²V-trained models showed improved robustness against these attacks and better generalization to other datasets.

Approach

The authors created the P²V dataset by combining transcripts generated by LLMs with realistic audio generated using various state-of-the-art voice cloning methods and then applying diverse audio perturbations to simulate real-world conditions. They then evaluated the performance of 22 existing audio deepfake detectors on this dataset.

Datasets

Perturbed Public Voices (P²V), In-The-Wild (ITW), ESC-50, EchoThief Impulse Response Library

Model(s)

RawNet3, LCNN, MesoNet, SpecRNet

Author countries

USA, Israel

← Previous