Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

View on arXiv ← Back to list

Authors: Guangke Chen, Yuhui Wang, Shouling Ji, Xiapu Luo, Ting Wang

Published: 2025-11-14 03:00:04+00:00

AI Summary

The paper introduces HARMGEN, a suite of five multi-modal attacks designed to bypass the safety mechanisms of Large Audio-Language Models (LALMs) used for Text-to-Speech (TTS). These attacks exploit semantic obfuscation (Concat, Shuffle) and audio-modality vulnerabilities (Read, Spell, Phoneme) to generate high-fidelity audio containing explicitly harmful linguistic content. The study demonstrates that HARMGEN substantially reduces refusal rates across commercial TTS systems and exposes critical vulnerabilities in current reactive and proactive defense mechanisms.

Abstract

Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.

Key findings

HARMGEN attacks are highly effective, compelling LALM-based TTS models that initially refused most toxic prompts to synthesize up to 100% of the harmful content. Text-modality attacks (Concat/Shuffle) generally achieved the highest efficacy in refusal reduction. Furthermore, state-of-the-art deepfake audio detectors (AASIST2) and reactive transcribe-then-moderate defenses largely fail against the generated high-fidelity adversarial audio, though proactive moderation by TTS providers detects between 57% and 93% of attack instances.

Approach

The authors propose HARMGEN, split into two families. The first family uses semantic concealment (Concat, Shuffle) by breaking up or reordering harmful text input to evade moderation, then reassembling the spoken audio. The second family uses multi-modal audio exploits (Read, Spell, Phoneme) where the harmful word is covertly provided to the LALM as an audio clip (reading, spelling, or phoneme sounds), forcing the model to synthesize the resulting full toxic sentence.

Datasets

Ethos, Mul-ZH, Self, LibriSpeech

Model(s)

AASIST2

Author countries

USA, China, Hong Kong

← Previous