X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Authors: Youngseo Kim, Kwan Yun, Seokhyeon Hong, Sihun Cha, Colette Suhjung Koo, Junyong Noh

Published: 2026-03-09 15:18:42+00:00

AI Summary

This paper proposes X-AVDT, a robust and generalizable deepfake detector that leverages internal audio-visual signals from generative models, accessed via DDIM inversion. X-AVDT extracts a video composite capturing inversion-induced discrepancies and an audio-visual cross-attention feature reflecting modality alignment. The research also introduces MMDF, a new multimodal deepfake dataset, demonstrating X-AVDT's leading performance and strong generalization to external benchmarks and unseen generators.

Abstract

The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce MMDF, a new multimodal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by 13.1%. Our findings highlight the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.


Key findings
X-AVDT achieved leading performance on the MMDF dataset and demonstrated strong generalization to external benchmarks and unseen deepfake generators, outperforming existing methods with an average accuracy improvement of 13.1%. The audio-conditioned cross-attention features were consistently the most informative, and the combination of the video composite and AV cross-attention feature proved crucial for enhanced detection reliability and robustness against various perturbations.
Approach
X-AVDT probes generator-internal audio-visual signals by performing DDIM inversion on input videos. It extracts two complementary cues: a video composite revealing inversion-induced discrepancies and an audio-visual cross-attention feature reflecting modality alignment enforced during generation. These features are then fused and processed by a detection network trained with a combined binary cross-entropy and triplet loss.
Datasets
MMDF (Multi-modal, Multi-generator DeepFake dataset), FakeAVCeleb, FaceForensics++, DeepSpeak v1.0, KoDF, Deepfake-Eval2024
Model(s)
Audio-conditioned Latent Diffusion Model (Hallo, initialized from Stable Diffusion), wav2vec 2.0 (for audio embeddings), 3D U-Net (for diffusion process), 3D ResNeXt (for feature encoders and fusion decoder)
Author countries
South Korea