MSCT: Differential Cross-Modal Attention for Deepfake Detection
Authors: Fangda Wei, Miao Liu, Yingxue Wang, Jing Wang, Shenghui Zhao, Nan Li
Published: 2026-04-09 02:56:16+00:00
Comment: Accpeted by ICASSP2026
AI Summary
This paper proposes a Multi-Scale Cross-modal Transformer Encoder (MSCT) for audio-visual deepfake detection, addressing issues of insufficient feature extraction and modal alignment deviation in traditional methods. The MSCT integrates multi-scale self-attention to capture adjacent embedding features and differential cross-modal attention to enhance multi-modal feature fusion. Experiments on the FakeAVCeleb dataset validate the effectiveness of the proposed architecture.
Abstract
Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.