Descriptor:: Extended-Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition (ELAD-SVDSR)

View on arXiv ← Back to list

Authors: Rahul Vijaykumar, Ajan Ahmed, John Parker, Dinesh Pendyala, Aidan Collins, Stephanie Schuckers, Masudul H. Imtiaz

Published: 2025-09-30 19:46:50+00:00

AI Summary

This paper introduces the Extended-Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition (ELAD-SVDSR), a resource comprising 45-minute audio recordings from 36 participants captured via five different microphones. The dataset is specifically designed to support the creation of high-quality, extended-duration deepfakes and to facilitate the development of robust synthetic voice detection systems. The initial release includes 20 generated deepfake voices, demonstrating the potential for generating highly realistic synthetic audio.

Abstract

This paper introduces the Extended Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition (ELAD SVDSR), a resource specifically designed to facilitate the creation of high quality deepfakes and support the development of detection systems trained against them. The dataset comprises 45 minute audio recordings from 36 participants, each reading various newspaper articles recorded under controlled conditions and captured via five microphones of differing quality. By focusing on extended duration audio, ELAD SVDSR captures a richer range of speech attributes such as pitch contours, intonation patterns, and nuanced delivery enabling models to generate more realistic and coherent synthetic voices. In turn, this approach allows for the creation of robust deepfakes that can serve as challenging examples in datasets used to train and evaluate synthetic voice detection methods. As part of this effort, 20 deepfake voices have already been created and added to the dataset to showcase its potential. Anonymized metadata accompanies the dataset on speaker demographics. ELAD SVDSR is expected to spur significant advancements in audio forensics, biometric security, and voice authentication systems.

Key findings

The ELAD-SVDSR dataset provides high-fidelity, extended-duration audio (average SNR 57.41 dB), making it suitable for training sophisticated voice models. Deepfakes generated using ELAD-SVDSR showed significantly higher mean normalized similarity scores (37.2%) compared to those generated from existing benchmark datasets (e.g., VCTK 15.6%), indicating its effectiveness in producing more realistic and challenging synthetic voices.

Approach

The authors developed a new descriptor dataset by collecting 45 minutes of read newspaper articles from 36 speakers under controlled conditions, ensuring diversity in microphone quality and speaker demographics. They validated the data quality using high Signal-to-Noise Ratio (SNR) metrics and generated 20 deepfakes using Tortoise TTS to evaluate the dataset's utility for advanced voice synthesis, benchmarking the quality using the VeriSpeak speaker recognition tool.

Datasets

ELAD-SVDSR

Model(s)

Tortoise TTS (for deepfake generation), VeriSpeak (for quality measurement)

Author countries

USA

← Previous