Divide and Conquer: Multimodal Video Deepfake Detection via Cross-Modal Fusion and Localization

Authors: Qingcao Li, Miao He, Liang Yi, Qing Wen, Yitao Zhang, Hongshuo Jin, Peng Cheng, Zhongjie Ba, Li Lu, Kui Ren

Published: 2026-01-30 13:47:42+00:00

Comment: The 3rd Place, IJCAI 2025 Workshop on Deepfake Detection, Localization, and Interpretability

AI Summary

This paper proposes a two-stage system for multimodal video deepfake detection and localization, developed for Track 2 of the DDL Challenge. It integrates an audio deepfake detection and localization module with an image-based detection and localization module. A multimodal score fusion strategy is then employed to effectively combine the outputs from both modalities, leveraging their complementary information to enhance detection robustness and localization accuracy.

Abstract

This paper presents a system for detecting fake audio-visual content (i.e., video deepfake), developed for Track 2 of the DDL Challenge. The proposed system employs a two-stage framework, comprising unimodal detection and multimodal score fusion. Specifically, it incorporates an audio deepfake detection module and an audio localization module to analyze and pinpoint manipulated segments in the audio stream. In parallel, an image-based deepfake detection and localization module is employed to process the visual modality. To effectively leverage complementary information across different modalities, we further propose a multimodal score fusion strategy that integrates the outputs from both audio and visual modules. Guided by a detailed analysis of the training and evaluation dataset, we explore and evaluate several score calculation and fusion strategies to improve system robustness. Overall, the final fusion-based system achieves an AUC of 0.87, an AP of 0.55, and an AR of 0.23 on the challenge test set, resulting in a final score of 0.5528.


Key findings
The final fusion-based system achieved an AUC of 0.87, an AP of 0.55, and an AR of 0.23 on the challenge test set. Multimodal fusion significantly improved video authenticity verification, showing a 7% increase in AUC compared to audio-only detection. The interval-wise fusion strategy for localization also demonstrated superior performance over single-modality baselines, especially with a retrained audio localization model.
Approach
The system utilizes a two-stage framework, starting with unimodal detection and localization for both audio and visual streams independently. For audio, it includes a detection module and a boundary-aware localization module; for visual, an image-based detection and localization module. The outputs from these unimodal modules are then integrated via distinct multimodal score fusion strategies for overall video detection and temporal localization.
Datasets
DDL Challenge (training, validation, and test sets), MUSAN (for audio augmentation), PartialSpoof (for pre-training audio localization model).
Model(s)
Audio Deepfake Detection: Wav2Vec2.0-AASIST (combining XLS-R and AASIST). Audio Localization: Boundary-aware Attention Mechanism (BAM) with WavLM-Large. Image Detection/Localization: Xception.
Author countries
China