Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges

Authors: Hashim Ali, Surya Subramani, Raksha Varahamurthy, Nithin Adupa, Lekha Bollinani, Hafiz Malik

Published: 2025-06-30 23:41:04+00:00

AI Summary

This paper introduces a comprehensive methodology for collecting, curating, and generating high-quality synthetic speech data for ten public figures, addressing the challenges of maintaining voice authenticity. It details an automated pipeline for bonafide speech sample collection, featuring transcription-based segmentation that significantly enhances synthetic speech quality. The resulting 'Famous Figures' dataset demonstrates superior naturalness with a NISQA-TTS score of 3.69 and achieves a 61.9% human misclassification rate, indicating high realism.

Abstract

Recent advances in speech synthesis have introduced unprecedented challenges in maintaining voice authenticity, particularly concerning public figures who are frequent targets of impersonation attacks. This paper presents a comprehensive methodology for collecting, curating, and generating synthetic speech data for political figures and a detailed analysis of challenges encountered. We introduce a systematic approach incorporating an automated pipeline for collecting high-quality bonafide speech samples, featuring transcription-based segmentation that significantly improves synthetic speech quality. We experimented with various synthesis approaches; from single-speaker to zero-shot synthesis, and documented the evolution of our methodology. The resulting dataset comprises bonafide and synthetic speech samples from ten public figures, demonstrating superior quality with a NISQA-TTS naturalness score of 3.69 and the highest human misclassification rate of 61.9\\%.


Key findings
The developed 'Famous Figures' dataset achieved superior synthetic speech quality, evidenced by a NISQA-TTS naturalness score of 3.69, which is higher than comparable 'in the wild' datasets. Human evaluation revealed a high misclassification rate of 61.9% for the synthetic speech samples, suggesting their high realism and difficulty for human detection. The systematic data collection and transcription-based segmentation significantly improved the overall quality of the generated synthetic speech.
Approach
The authors developed an automated pipeline to collect high-quality bonafide speech samples from public figures, employing transcription-based segmentation to improve synthetic speech quality. They generated synthetic speech using various text-to-speech (TTS) approaches, including speaker-specific training (e.g., StyleTTS2), few-shot fine-tuning (e.g., XTTSv2, StyleTTS2), and zero-shot synthesis (e.g., F5TTS, E2TTS, FishSpeech, etc.).
Datasets
Famous Figures (created by the authors)
Model(s)
UNKNOWN
Author countries
USA