DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection

View on arXiv ← Back to list

Authors: Kangran Zhao, Yupeng Chen, Xiaoyu Zhang, Yize Chen, Weinan Guan, Baicheng Chen, Chengzhe Sun, Soumyya Kanti Datta, Qingshan Liu, Siwei Lyu, Baoyuan Wu

Published: 2025-10-26 10:40:52+00:00

AI Summary

This paper introduces Mega-MMDF, a new large-scale, high-quality multimodal deepfake dataset containing 1.1 million forged audiovisual samples generated via 21 diverse forgery pipelines. Based on this, they develop DeepfakeBench-MM, the first unified benchmark to standardize protocols for training and evaluation of multimodal deepfake detection (MM-DFD). Comprehensive experiments using DeepfakeBench-MM uncover critical insights into cross-dataset generalization, modality bias, and the effectiveness of finetuning in MM-DFD models.

Abstract

The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human-centric audiovisual content, which poses substantial societal risks (e.g., financial fraud and social instability). In response to this growing threat, several works have preliminarily explored countermeasures. However, the lack of sufficient and diverse training data, along with the absence of a standardized benchmark, hinder deeper exploration. To address this challenge, we first build Mega-MMDF, a large-scale, diverse, and high-quality dataset for multimodal deepfake detection. Specifically, we employ 21 forgery pipelines through the combination of 10 audio forgery methods, 12 visual forgery methods, and 6 audio-driven face reenactment methods. Mega-MMDF currently contains 0.1 million real samples and 1.1 million forged samples, making it one of the largest and most diverse multimodal deepfake datasets, with plans for continuous expansion. Building on it, we present DeepfakeBench-MM, the first unified benchmark for multimodal deepfake detection. It establishes standardized protocols across the entire detection pipeline and serves as a versatile platform for evaluating existing methods as well as exploring novel approaches. DeepfakeBench-MM currently supports 5 datasets and 11 multimodal deepfake detectors. Furthermore, our comprehensive evaluations and in-depth analyses uncover several key findings from multiple perspectives (e.g., augmentation, stacked forgery). We believe that DeepfakeBench-MM, together with our large-scale Mega-MMDF, will serve as foundational infrastructures for advancing multimodal deepfake detection.

Key findings

The benchmark reveals that advanced multimodal detectors achieve only marginal gains over strong baselines and suffer from poor cross-dataset generalization. Finetuning the second-stage backbones in two-phase training consistently improves performance, contrary to common practice. Furthermore, detectors show a consistent bias toward the visual modality, although modality masking during training is proposed as a promising mitigation strategy.

Approach

The research addresses the lack of standardized infrastructure by first building the Mega-MMDF dataset with high diversity, using combinations of 28 audio, visual, and face reenactment methods, and integrating a quality control mechanism. They then establish DeepfakeBench-MM, a unified benchmark supporting 5 existing datasets and 11 detectors, which standardizes data preprocessing and evaluation protocols for fair comparison and analysis.

Datasets

Mega-MMDF, FakeAVCeleb, LAV-DF, AVDeepfake1M, IDForge

Model(s)

Baseline (C3D-ResNet18, SE-ResNet18), Ensemble, AVTS, MRDF, AVFF, MDS, FRADE, AVAD, AVH, Qwen2.5-Omni (MLLM), VideoLLaMA2 (MLLM)

Author countries

China, USA

← Previous