Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

Authors: Miao Liu, Fangda Wei, Jing Wang, Xinyuan Qian

Published: 2026-04-14 12:20:07+00:00

Comment: Submitted to ACMMM 2026

AI Summary

This paper introduces the novel task of Listening Deepfake Detection (LDD), which focuses on manipulated listening reactions rather than speaking-centric forgeries. To address this, the authors present ListenForge, the first dataset specifically designed for LDD, and propose MANet, a Motion-aware and Audio-guided Network. MANet effectively captures subtle motion inconsistencies in listener videos and leverages speaker's audio semantics, demonstrating superior performance compared to existing speaking deepfake detection models.

Abstract

Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker's appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of 'listening deepfakes' remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker's audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.


Key findings
Existing Speaking Deepfake Detection (SDD) models perform poorly on the LDD task, highlighting the distinct nature of listening forgeries. The proposed MANet significantly outperforms all baseline and retrained SDD models on the ListenForge dataset, achieving high AUC and ACC. Ablation studies confirm the effectiveness of both the Motion-Aware Module for capturing visual artifacts and the Audio-Guided Module for leveraging cross-modal consistency, underscoring the importance of their tailored design.
Approach
The proposed MANet employs a Motion-Aware Module (MAM) to capture subtle temporal inconsistencies and facial expression cues in listener videos by analyzing temporal differences between frames and applying attention mechanisms. An Audio-Guided Module (AGM) then fuses these visual features with speaker's audio semantics, using the speaker's audio as a guiding query to reinforce listener's visual responses, thereby detecting contextual interaction inconsistencies.
Datasets
ListenForge (constructed from ViCo and NoXi corpora), FaceForensics++ (for comparative analysis of SDD models).
Model(s)
MANet (Motion-aware and Audio-guided Network), ResNet (for visual feature extraction, pretrained on ImageNet1K_V1), Wav2vec 2.0 (for audio feature extraction, pretrained on LibriSpeech).
Author countries
China