Audio Deepfake Detection in the Age of Advanced Text-to-Speech models

Authors: Robin Singh, Aditya Yogesh Nair, Fabio Palumbo, Florian Barbaro, Anna Dyka, Lohith Rachakonda

Published: 2026-01-28 11:39:40+00:00

Comment: This work was performed using HPC resources from GENCI-IDRIS (Grant 2025- AD011016076)

AI Summary

This work evaluates three advanced Text-to-Speech (TTS) models—Dia2, Maya1, and MeloTTS—against various audio deepfake detection frameworks to assess their robustness against modern synthetic speech. It finds that detector performance varies significantly across different TTS architectures, with LLM-based synthesis posing a particular challenge for single-paradigm detectors. The study highlights the necessity of integrated, multi-view detection strategies for robust performance against the evolving landscape of audio deepfake threats.

Abstract

Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.


Key findings
Detector performance varies significantly depending on the underlying TTS generative mechanism; models effective against one architecture may fail against others, especially LLM-based synthesis (Maya1). Semantic detectors (Whisper-MesoNet) show vulnerabilities to LLM-based TTS but excel against non-autoregressive systems, while structural/hierarchical detectors (XLS-R-SLS, SSL-AASIST) are more effective against autoregressive artifacts. A proprietary multi-view detection approach (UncovAI) demonstrated near-perfect separation across all evaluated modern TTS models, highlighting the need for integrated strategies.
Approach
The authors generated a corpus of 12,000 synthetic audio samples using three state-of-the-art TTS models (Dia2, Maya1, MeloTTS) and the DailyDialog dataset. These samples were then evaluated against four detection frameworks: a semantic approach (Whisper-MesoNet), two structural/hierarchical approaches (SSL-AASIST and XLS-R-SLS), and a proprietary multi-view detection model (UncovAI).
Datasets
DailyDialog, UncovAI TTS synthetic and real multilingual text-to-speech dataset
Model(s)
Whisper-MesoNet, SSL-AASIST (wav2vec 2.0 XLS-R + AASIST), XLS-R-SLS (XLS-R-300M with Sensitive Layer Selection), UncovAI Detector
Author countries
UNKNOWN