MSCT: Differential Cross-Modal Attention for Deepfake Detection

Authors: Fangda Wei, Miao Liu, Yingxue Wang, Jing Wang, Shenghui Zhao, Nan Li

Published: 2026-04-09 02:56:16+00:00

Comment: Accpeted by ICASSP2026

AI Summary

This paper proposes a Multi-Scale Cross-modal Transformer Encoder (MSCT) for audio-visual deepfake detection, addressing issues of insufficient feature extraction and modal alignment deviation in traditional methods. The MSCT integrates multi-scale self-attention to capture adjacent embedding features and differential cross-modal attention to enhance multi-modal feature fusion. Experiments on the FakeAVCeleb dataset validate the effectiveness of the proposed architecture.

Abstract

Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.


Key findings
The proposed MSCT achieved competitive performance on the FakeAVCeleb dataset, demonstrating a classification accuracy of 98.75% and an AUC score of 98.83%. Ablation studies confirmed that both the differential cross-modal attention (DCA) and multi-scale self-attention (MSSA) modules contribute to the model's performance improvement, with DCA providing a more significant gain.
Approach
The authors propose a Multi-Scale Cross-modal Transformer Encoder (MSCT) that integrates two novel attention modules. It uses a multi-scale self-attention module to extract multi-scale temporal features by adaptively integrating information from adjacent embeddings, and a differential cross-modal attention module that leverages attention matrix differences to better focus on forgery cues and improve compatibility with cross-modal alignment loss.
Datasets
FakeAVCeleb
Model(s)
Multi-scale Cross-modal Transformer Encoder (MSCT), Transformer Encoder, Res2Net (adapted visual pre-encoder), Wavelet convolution module, Convolutional Block Attention Module (CBAM)
Author countries
China