ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

Authors: Aurosweta Mahapatra, Ismail Rasim Ulgen, Kong Aik Lee, Nicholas Andrews, Berrak Sisman

Published: 2026-04-14 18:56:13+00:00

Comment: Submitted to Interspeech 2026

AI Summary

ProSDD is a two-stage framework for speech deepfake detection designed to improve generalization against expressive and emotional spoofing attacks. It enhances model embeddings by first learning speaker-conditioned prosodic variation from real speech via supervised masked prediction, then jointly optimizes this objective with spoof classification. The framework significantly outperforms baselines on challenging emotional datasets while maintaining strong performance on standard benchmarks.

Abstract

Speech deepfake detection (SDD) systems perform well on standard benchmarks datasets but often fail to generalize to expressive and emotional spoofing attacks. Many methods rely on spoof-heavy training data, learning dataset-specific artifacts rather than transferable cues of natural speech. In contrast, humans internalize variability in real speech and detect fakes as deviations from it. We introduce ProSDD, a two-stage framework that enriches model embeddings through supervised masked prediction of speaker-conditioned prosodic variation based on pitch, voice activity, and energy. Stage I learns prosodic variability from real speech, and Stage II jointly optimizes this objective with spoof classification. ProSDD consistently outperforms baselines under both ASVspoof 2019 and 2024 training, reducing ASVspoof 2024 EER from 25.43% to 16.14% (2019-trained) and from 39.62% to 7.38% (2024-trained), while achieving 50% relative reductions on EmoFake and EmoSpoof-TTS.

Key findings

ProSDD consistently reduces Equal Error Rates (EER) on expressive and emotional datasets, achieving up to 50% relative reduction on EmoFake and EmoSpoof-TTS, and significant improvements on ASVspoof 2024. The two-stage approach, particularly the real-only prosodic pretraining in Stage I, is critical for improved generalization across distribution shifts and attack types, while maintaining competitive performance on traditional benchmarks.

Approach

ProSDD employs a two-stage training strategy: Stage I fine-tunes a pretrained SSL backbone (XLS-R) on bona fide speech to learn structured speaker-conditioned prosodic representations using a supervised masked prediction objective. Stage II initializes with Stage I weights and jointly optimizes spoof classification with the same prosodic masked prediction objective as an auxiliary task, utilizing both real and fake speech.

Datasets

LibriSpeech train-clean-100, LibriSpeech dev, ASVspoof 2019 LA train/dev, ASVspoof 2024 train/dev, ASVspoof 2019 LA, ASVspoof 2021 LA, EmoFake, EmoSpoof-TTS, ASVspoof 2024 Track 1

Model(s)

ProSDD (framework), XLS-R (SSL backbone), ECAPA-TDNN (speaker embedding model), prosody encoder

Author countries

USA, Hong Kong

← Previous