Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study

Authors: Alabi Ahmed, Vandana Janeja, Sanjay Purushotham

Published: 2026-01-30 20:38:10+00:00

Comment: This work was presented at the 2025 IEEE International Conference on Data Mining, ICDM 2025, November 12-15,2025, Washington DC, USA

AI Summary

This paper addresses the underexplored threat of multi-speaker conversational audio deepfakes by proposing a conceptual taxonomy and introducing a new dataset, Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD). MsCADD comprises 2,830 real and fully synthetic two-speaker conversations generated using VITS and SoundStorm-based NotebookLM models. The authors benchmark three neural baseline models (LFCC-LCNN, RawNet2, and Wav2Vec 2.0) on this dataset to provide a foundation for future research in this challenging area.

Abstract

The rapid advances in text-to-speech (TTS) technologies have made audio deepfakes increasingly realistic and accessible, raising significant security and trust concerns. While existing research has largely focused on detecting single-speaker audio deepfakes, real-world malicious applications with multi-speaker conversational settings is also emerging as a major underexplored threat. To address this gap, we propose a conceptual taxonomy of multi-speaker conversational audio deepfakes, distinguishing between partial manipulations (one or multiple speakers altered) and full manipulations (entire conversations synthesized). As a first step, we introduce a new Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two-speaker conversations, generated using VITS and SoundStorm-based NotebookLM models to simulate natural dialogue with variations in speaker gender, and conversational spontaneity. MsCADD is limited to text-to-speech (TTS) types of deepfake. We benchmark three neural baseline models; LFCC-LCNN, RawNet2, and Wav2Vec 2.0 on this dataset and report performance in terms of F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR). Results show that these baseline models provided a useful benchmark, however, the results also highlight that there is a significant gap in multi-speaker deepfake research in reliably detecting synthetic voices under varied conversational dynamics. Our dataset and benchmarks provide a foundation for future research on deepfake detection in conversational scenarios, which is a highly underexplored area of research but also a major area of threat to trustworthy information in audio settings. The MsCADD dataset is publicly available to support reproducibility and benchmarking by the research community.

Key findings

The pilot study reveals that while modern end-to-end models like RawNet2 and self-supervised models like Wav2Vec 2.0 outperform traditional convolutional baselines such as LFCC-LCNN, significant challenges remain. Specifically, all models struggle with reliably detecting synthetic voices under varied conversational dynamics, highlighting the need for detection strategies that leverage conversational structure for improved performance in multi-speaker deepfake scenarios.

Approach

The authors define a taxonomy for multi-speaker conversational audio deepfakes. They then create a novel dataset, MsCADD, consisting of real and fully synthetic two-speaker conversations generated with VITS and SoundStorm-based models. Finally, they benchmark three established neural models (LFCC-LCNN, RawNet2, Wav2Vec 2.0) on MsCADD to evaluate their performance in detecting these deepfakes.

Datasets

Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD), English Conversation Corpus

Model(s)

LFCC-LCNN, RawNet2, Wav2Vec 2.0

Author countries

United States

← Previous