Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

Authors: Guangke Chen, Yuhui Wang, Shouling Ji, Xiapu Luo, Ting Wang

Published: 2025-11-14 03:00:04+00:00

AI Summary

This research introduces HARMGEN, a suite of five attacks designed to compel Large Audio-Language Model (LALM)-based Text-to-Speech (TTS) systems to generate speech containing harmful content, bypassing safety alignments and moderation filters. The attacks utilize semantic obfuscation techniques for text and audio-modality exploits to covertly inject harmful words. The study evaluates these attacks across multiple commercial LALMs and assesses the effectiveness of reactive and proactive countermeasures, revealing significant vulnerabilities in current defenses.

Abstract

Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.


Key findings
The HARMGEN attacks substantially reduce refusal rates and significantly increase the toxicity of generated speech across various commercial LALMs, with some models compelled to synthesize 100% of hate-speech prompts. Reactive defenses, including state-of-the-art deepfake audio detectors, underperform on high-fidelity LALM outputs, and reactive text moderation can be circumvented by adversarial perturbations. Proactive moderation by TTS providers, which moderates model-emitted text before release, is shown to be substantially more effective, detecting 57-93% of attacks.
Approach
The authors propose HARMGEN, a suite of five attacks organized into two families: semantic obfuscation and audio-modality exploitation. Semantic obfuscation attacks (Concat, Shuffle) conceal harmful content within text to bypass input/output filters. Audio-modality exploits (Read, Spell, Phoneme) inject harmful content through auxiliary audio channels (e.g., reading a word, spelling it, or pronouncing its phonemes) while maintaining benign textual prompts, bypassing safety alignment.
Datasets
Ethos, Mul-ZH, Self (custom dataset)
Model(s)
OpenAI’s GPT-4o-mini-audio, OpenAI’s GPT-4o-mini-tts, OpenAI’s GPT-5o-nano, Google’s Gemini-2.5-live, Alibaba’s Qwen-omni-turbo (LALMs-based TTS systems); Google TTS, ByteDance’s IndexTTS (conventional TTS systems); AASIST2 (deepfake audio detector); Whisper (speech recognition model); OpenAI's moderation API (text moderation); Mudes (toxic span detection); Montreal Forced Aligner; Detoxify, COLD (toxicity scoring).
Author countries
USA, China, Hong Kong