Where are we in audio deepfake detection? A systematic analysis over generative and detection models

Authors: Xiang Li, Pin-Yu Chen, Wenqi Wei

Published: 2024-10-06 01:03:42+00:00

AI Summary

This paper introduces SONAR, a synthetic AI-Audio Detection Framework and Benchmark designed for comprehensively evaluating cutting-edge AI-synthesized auditory content. SONAR includes a novel evaluation dataset from 9 diverse audio synthesis platforms and is the first framework to uniformly benchmark AI-audio detection across traditional and foundation model-based detection systems. Through extensive experiments, the authors identify generalization limitations of existing methods and highlight the superior performance of foundation models.

Abstract

Recent advances in Text-to-Speech (TTS) and Voice-Conversion (VC) using generative Artificial Intelligence (AI) technology have made it possible to generate high-quality and realistic human-like audio. This poses growing challenges in distinguishing AI-synthesized speech from the genuine human voice and could raise concerns about misuse for impersonation, fraud, spreading misinformation, and scams. However, existing detection methods for AI-synthesized audio have not kept pace and often fail to generalize across diverse datasets. In this paper, we introduce SONAR, a synthetic AI-Audio Detection Framework and Benchmark, aiming to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. SONAR includes a novel evaluation dataset sourced from 9 diverse audio synthesis platforms, including leading TTS providers and state-of-the-art TTS models. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems. Through extensive experiments, (1) we reveal the limitations of existing detection methods and demonstrate that foundation models exhibit stronger generalization capabilities, likely due to their model size and the scale and quality of pretraining data. (2) Speech foundation models demonstrate robust cross-lingual generalization capabilities, maintaining strong performance across diverse languages despite being fine-tuned solely on English speech data. This finding also suggests that the primary challenges in audio deepfake detection are more closely tied to the realism and quality of synthetic audio rather than language-specific characteristics. (3) We explore the effectiveness and efficiency of few-shot fine-tuning in improving generalization, highlighting its potential for tailored applications, such as personalized detection systems for specific entities or individuals.


Key findings
Foundation models demonstrate significantly stronger generalization capabilities across diverse datasets and languages compared to traditional detection methods, attributed to their model size and extensive pretraining data. Cross-lingual generalization of speech foundation models suggests that the primary challenges in audio deepfake detection relate more to the realism of synthetic audio than language-specific characteristics. Few-shot fine-tuning proved effective and efficient for improving generalization on challenging subsets, showing potential for personalized detection systems.
Approach
The authors introduce SONAR, a comprehensive framework that includes a new, diverse evaluation dataset of AI-synthesized audio sourced from 9 platforms, including state-of-the-art TTS models. They benchmark a range of traditional and foundation deepfake detection models, analyzing their generalization capabilities across various datasets, languages, and few-shot fine-tuning scenarios.
Datasets
SONAR (a novel evaluation dataset compiled from OpenAI, xTTS, AudioGen, Seed-TTS, VALL-E, PromptTTS2, NaturalSpeech3, VoiceBox, FlashSpeech, and real speech from LibriTTS), Wavefake, LibriSeVoc, In-the-wild, MLAAD, ASVSpoof2019 (VC subset).
Model(s)
AASIST, RawGAT-ST, RawNet2, Spectrogram+ResNet, LFCC-LCNN, Wave2Vec2, Wave2Vec2BERT, HuBERT, CLAP, Whisper-tiny, Whisper-base, Whisper-small, Whisper-medium, Whisper-large.
Author countries
USA