Are audio DeepFake detection models polyglots?

Authors: Bartłomiej Marek, Piotr Kawa, Piotr Syga

Published: 2024-12-23 19:32:53+00:00

Comment: Keywords: Audio DeepFakes, DeepFake detection, multilingual audio DeepFakes

AI Summary

This paper benchmarks multilingual audio DeepFake detection, evaluating various adaptation strategies on models primarily trained with English datasets. It investigates the generalizability of these models to non-English languages and compares intra-linguistic and cross-linguistic adaptation approaches. The study highlights significant variations in detection efficacy across languages and underscores the critical importance of even limited target-language data for effective DeepFake detection.

Abstract

Since the majority of audio DeepFake (DF) detection methods are trained on English-centric datasets, their applicability to non-English languages remains largely unexplored. In this work, we present a benchmark for the multilingual audio DF detection challenge by evaluating various adaptation strategies. Our experiments focus on analyzing models trained on English benchmark datasets, as well as intra-linguistic (same-language) and cross-linguistic adaptation approaches. Our results indicate considerable variations in detection efficacy, highlighting the difficulties of multilingual settings. We show that limiting the dataset to English negatively impacts the efficacy, while stressing the importance of the data in the target language.


Key findings
Audio DeepFake detection efficacy varies significantly across languages; some languages (French, Polish, Italian, Ukrainian) surprisingly showed better detection than English with English-trained models, while others (Russian, Spanish) were more challenging. Intra-linguistic adaptation proved most effective, where even a small amount of target-language data significantly improved detection accuracy. This approach consistently outperformed augmenting the training set with larger, more diverse multilingual data that excluded the target language.
Approach
The authors benchmark multilingual audio DeepFake detection by evaluating various models under different training and adaptation strategies. These strategies include using English-trained models as a baseline, training from scratch on target languages, and fine-tuning English pre-trained models with single or multiple languages (both intra-linguistic and cross-linguistic adaptation) to assess their generalization capabilities and the impact of limited target-language data.
Datasets
ASVspoof2019 LA, Multi-Language Audio Anti-spoofing Dataset (MLAAD) v3, M-AILABS Speech Dataset
Model(s)
W2V+AASIST, Whisper+AASIST, LFCC+AASIST, LFCC+MesoNet, RawGAT-ST
Author countries
Germany, Poland