The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection

Authors: Yupei Li, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang, Björn W. Schuller

Published: 2025-12-12 02:49:18+00:00

AI Summary

The paper proposes "The Affective Bridge" (EmoBridge), a novel training framework designed to unify diverse feature representations (acoustic, ASR, SV) for speech deepfake detection. EmoBridge leverages emotion recognition as a bridging task, using continual learning to transfer and align existing features, creating a more robust and interpretable feature space. This method consistently improves detection performance, achieving substantial gains in accuracy and EER on datasets like FakeOrReal and In-the-Wild.

Abstract

Speech deepfake detection has been widely explored using low-level acoustic descriptors. However, each study tends to select different feature sets, making it difficult to establish a unified representation for the task. Moreover, such features are not intuitive for humans to perceive, as the distinction between bona fide and synthesized speech becomes increasingly subtle with the advancement of deepfake generation techniques. Emotion, on the other hand, remains a unique human attribute that current deepfake generator struggles to fully replicate, reflecting the gap toward true artificial general intelligence. Interestingly, many existing acoustic and semantic features have implicit correlations with emotion. For instance, speech features recognized by automatic speech recognition systems often varies naturally with emotional expression. Based on this insight, we propose a novel training framework that leverages emotion as a bridge between conventional deepfake features and emotion-oriented representations. Experiments on the widely used FakeOrReal and In-the-Wild datasets demonstrate consistent and substantial improvements in accuracy, up to approximately 6% and 2% increases, respectively, and in equal error rate (EER), showing reductions of up to about 4% and 1%, respectively, while achieving comparable results on ASVspoof2019. This approach provides a unified training strategy for all features and interpretable feature direction for deepfake detection while improving model performance through emotion-informed learning.


Key findings
The EmoBridge strategy consistently improved performance, achieving up to a 6% increase in accuracy and a 4% reduction in EER on the FakeOrReal dataset, and up to a 2% accuracy increase on the In-the-Wild dataset. The integration of affective cues proved most beneficial for deep-learned raw features and for deepfakes derived from real human speech, suggesting that emotion-related features are a strong discriminative cue against current generation methods. Performance on ASVSpoof2019 was comparable to baseline methods.
Approach
The EmoBridge framework uses a continual learning approach where pre-trained feature encoders (from models like Whisper, SpeechT5, or WavLM) are fine-tuned on a general emotion recognition task using a diverse set of emotion datasets. This process fuses affective cues into the encoder representations, which are then frozen and used to extract input features for a final deepfake detection classifier, such as a Support Vector Machine (SVM).
Datasets
ASVSpoof2019 LA, FakeOrReal (FoR), In-the-Wild (ITW), Toronto Emotional Speech Set (TESS), Surrey Audio-Visual Expressed Emotion (SAVEE), CREMA-D, RAVDESS, Emotion Speech Dataset (ESD).
Model(s)
openSMILE, Whisper, SpeechT5, WavLM, SVM, HuBERT.
Author countries
UK, Germany