Identity-Driven Multimedia Forgery Detection via Reference Assistance

Authors: Junhao Xu, Jingjing Chen, Xue Song, Feng Han, Haijun Shan, Yugang Jiang

Published: 2024-01-22 08:59:09+00:00

AI Summary

This paper introduces IDForge, an identity-driven multimedia forgery dataset featuring 249,138 video shots of 54 celebrities with 9 types of manipulation across visual, audio, and textual modalities, along with a reference set of 214,438 real video shots. Correspondingly, they propose the Reference-assisted Multimodal Forgery Detection Network (R-MFDN) to detect deepfake videos by leveraging identity information and cross-modal inconsistencies.

Abstract

Recent advancements in deepfake techniques have paved the way for generating various media forgeries. In response to the potential hazards of these media forgeries, many researchers engage in exploring detection methods, increasing the demand for high-quality media forgery datasets. Despite this, existing datasets have certain limitations. Firstly, most datasets focus on manipulating visual modality and usually lack diversity, as only a few forgery approaches are considered. Secondly, the quality of media is often inadequate in clarity and naturalness. Meanwhile, the size of the dataset is also limited. Thirdly, it is commonly observed that real-world forgeries are motivated by identity, yet the identity information of the individuals portrayed in these forgeries within existing datasets remains under-explored. For detection, identity information could be an essential clue to boost performance. Moreover, official media concerning relevant identities on the Internet can serve as prior knowledge, aiding both the audience and forgery detectors in determining the true identity. Therefore, we propose an identity-driven multimedia forgery dataset, IDForge, which contains 249,138 video shots sourced from 324 wild videos of 54 celebrities collected from the Internet. The fake video shots involve 9 types of manipulation across visual, audio, and textual modalities. Additionally, IDForge provides extra 214,438 real video shots as a reference set for the 54 celebrities. Correspondingly, we propose the Reference-assisted Multimodal Forgery Detection Network (R-MFDN), aiming at the detection of deepfake videos. Through extensive experiments on the proposed dataset, we demonstrate the effectiveness of R-MFDN on the multimedia detection task.


Key findings
The proposed R-MFDN achieves state-of-the-art performance on the IDForge dataset, demonstrating the effectiveness of combining multimodal and identity information. Ablation studies show significant improvements from both identity-aware and cross-modal contrastive learning, with R-MFDN outperforming baselines on binary and multi-label deepfake detection tasks, and also showing superior performance on the FakeAVCeleb dataset.
Approach
The R-MFDN utilizes modality-specific encoders for visual, audio, and textual features, followed by a progressive multimodal feature fusion module. It incorporates identity-aware contrastive learning and cross-modal contrastive learning to enhance feature learning and capture inconsistencies between modalities, using a reference set of real videos for identity comparison.
Datasets
IDForge (proposed), FakeAVCeleb, VoxCeleb2 (for building reference set for FakeAVCeleb)
Model(s)
Reference-assisted Multimodal Forgery Detection Network (R-MFDN), which includes transformer-based encoders for visual and audio features, and BERT for textual features. The paper also mentions VideoMAE and Xception for feature extraction in visualizations.
Author countries
China