Emotion and Acoustics Should Agree: Cross-Level Inconsistency Analysis for Audio Deepfake Detection

Authors: Jinhua Zhang, Zhenqi Jia, Rui Liu

Published: 2026-01-20 11:01:26+00:00

Comment: Accepted by ICASSP 2026

AI Summary

This paper proposes EAI-ADD, a novel audio deepfake detection framework that leverages cross-level emotion-acoustic inconsistency as the primary detection signal. It addresses limitations of prior methods that isolate features or rely on correlation, which often overlook subtle desynchronization and abrupt discontinuities in spoofed speech. EAI-ADD projects emotional and acoustic representations into a comparable space and progressively integrates frame-level and utterance-level emotion features with acoustic features to capture inconsistencies across different temporal granularities.

Abstract

Audio Deepfake Detection (ADD) aims to detect spoof speech from bonafide speech. Most prior studies assume that stronger correlations within or across acoustic and emotional features imply authenticity, and thus focus on enhancing or measuring such correlations. However, existing methods often treat acoustic and emotional features in isolation or rely on correlation metrics, which overlook subtle desynchronization between them and smooth out abrupt discontinuities. To address these issues, we propose EAI-ADD, which treats cross level emotion acoustic inconsistency as the primary detection signal. We first project emotional and acoustic representations into a comparable space. Then we progressively integrate frame level and utterance level emotion features with acoustic features to capture cross level emotion acoustic inconsistencies across different temporal granularities. Experimental results on the ASVspoof 2019LA and 2021LA datasets demonstrate that the proposed EAI-ADD outperforms baselines, providing a more effective solution for audio anti spoofing detection.


Key findings
EAI-ADD consistently outperforms strong baselines on both ASVspoof 2019LA and 2021LA datasets, achieving a min t-DCF of 0.0110 and an EER of 0.34% on 2019LA. The method demonstrates improved generalization to unseen spoofing conditions, primarily due to its explicit modeling of emotion-acoustic inconsistency and cross-level graph relations, which highlight abnormal emotion variations and mismatches.
Approach
EAI-ADD detects audio deepfakes by modeling the inconsistency between emotional dynamics and acoustic patterns. It employs an Emotion-Acoustic Alignment Module (EAAM) to project features into a unified space and an Emotion-Acoustic Inconsistency Modeling Module (EAIMM) that uses an Emotional Variation Amplification Loss (EVAL) and a Hierarchical Inconsistency Graph (HIG) to capture cross-level inconsistencies.
Datasets
ASVspoof 2019LA, ASVspoof 2021LA
Model(s)
WavLM (fine-tuned with AASIST), Emotion2vec, SincNet, Graph Attention Networks (GAT)
Author countries
China