Audio Deepfake Detection in the Age of Advanced Text-to-Speech models
Authors: Robin Singh, Aditya Yogesh Nair, Fabio Palumbo, Florian Barbaro, Anna Dyka, Lohith Rachakonda
Published: 2026-01-28 11:39:40+00:00
Comment: This work was performed using HPC resources from GENCI-IDRIS (Grant 2025- AD011016076)
AI Summary
This work evaluates three advanced Text-to-Speech (TTS) models—Dia2, Maya1, and MeloTTS—against various audio deepfake detection frameworks to assess their robustness against modern synthetic speech. It finds that detector performance varies significantly across different TTS architectures, with LLM-based synthesis posing a particular challenge for single-paradigm detectors. The study highlights the necessity of integrated, multi-view detection strategies for robust performance against the evolving landscape of audio deepfake threats.
Abstract
Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.