Multilingual Source Tracing of Speech Deepfakes: A First Benchmark

Authors: Xi Xuan, Yang Xiao, Rohan Kumar Das, Tomi Kinnunen

Published: 2025-08-06 07:11:36+00:00

Comment: Accepted at Interspeech SPSC 2025 - 5th Symposium on Security and Privacy in Speech Communication (Oral)

AI Summary

This paper introduces the first benchmark for multilingual speech deepfake source tracing, investigating both mono- and cross-lingual scenarios. It comparatively analyzes DSP- and SSL-based modeling, evaluating how SSL representations fine-tuned on different languages impact cross-lingual generalization performance. The work also assesses generalization to unseen languages and speakers, providing initial insights into the challenges of identifying speech generation models when training and inference languages differ.

Abstract

Recent progress in generative AI has made it increasingly easy to create natural-sounding deepfake speech from just a few seconds of audio. While these tools support helpful applications, they also raise serious concerns by making it possible to generate convincing fake speech in many languages. Current research has largely focused on detecting fake speech, but little attention has been given to tracing the source models used to generate it. This paper introduces the first benchmark for multilingual speech deepfake source tracing, covering both mono- and cross-lingual scenarios. We comparatively investigate DSP- and SSL-based modeling; examine how SSL representations fine-tuned on different languages impact cross-lingual generalization performance; and evaluate generalization to unseen languages and speakers. Our findings offer the first comprehensive insights into the challenges of identifying speech generation models when training and inference languages differ. The dataset, protocol and code are available at https://github.com/xuanxixi/Multilingual-Source-Tracing.


Key findings
The study found that in monolingual scenarios, SSL front-ends fine-tuned on language-specific data achieved the best performance. However, LFCC features combined with ResNet or ECAPA-TDNN backends demonstrated superior cross-lingual generalization capabilities. Cross-lingual generalization was stronger within the same language family, but significant performance variations persisted across different language pairs, and no consistent performance gap was observed for seen vs. unseen pseudo-speakers.
Approach
The authors establish the first multilingual benchmark (MCL-MLAAD) for speech deepfake source tracing, featuring mono-lingual and cross-lingual protocols, including generalization to unseen languages and speakers. They compare DSP-based models (LFCC with ResNet18, AASIST, ECAPA-TDNN backends) against SSL-based models (XLS-R-300M, wav2vec2.0 Large LV-60, and language-specific fine-tuned XLS-R variants) with an AASIST backend, evaluating their ability to identify the source generative model.
Datasets
MCL-MLAAD (a refined version of MLAAD v5), MUSAN (for noise perturbations).
Model(s)
LFCC-ResNet18, LFCC-AASIST, LFCC-ECAPA-TDNN, XLS-R-300M-AASIST, wav2vec2.0 Large LV-60-AASIST, language-specific fine-tuned XLS-R (large-xlsr-53-en/de/fr/it/pl/ru)-AASIST.
Author countries
Finland, China, Australia, Singapore