Towards General Visual-Linguistic Face Forgery Detection

Authors: Ke Sun, Shen Chen, Taiping Yao, Haozhe Yang, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji

Published: 2023-07-31 10:22:33+00:00

AI Summary

This paper introduces Visual-Linguistic Face Forgery Detection (VLFFD) to enhance deepfake detection by incorporating fine-grained, sentence-level prompts as supervision, addressing the limitations of coarse binary labels. The approach utilizes a Prompt Forgery Image Generator (PFIG) to create mixed forgery images with semantic annotations and a Coarse-and-Fine Co-training framework (C2F) to jointly train multimodal encoders, leading to improved generalization and interpretability.

Abstract

Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We argue that such supervisions lack semantic information and interpretability. To address this issues, in this paper, we propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation. Since text annotations are not available in current deepfakes datasets, VLFFD first generates the mixed forgery image with corresponding fine-grained prompts via Prompt Forgery Image Generator (PFIG). Then, the fine-grained mixed data and coarse-grained original data and is jointly trained with the Coarse-and-Fine Co-training framework (C2F), enabling the model to gain more generalization and interpretability. The experiments show the proposed method improves the existing detection models on several challenging benchmarks. Furthermore, we have integrated our method with multimodal large models, achieving noteworthy results that demonstrate the potential of our approach. This integration not only enhances the performance of our VLFFD paradigm but also underscores the versatility and adaptability of our method when combined with advanced multimodal technologies, highlighting its potential in tackling the evolving challenges of deepfake detection.


Key findings
The VLFFD method significantly outperforms state-of-the-art detectors on challenging cross-dataset and cross-manipulation benchmarks, demonstrating improved generalization and interpretability. Leveraging fine-grained linguistic supervision and multimodal learning notably boosts performance, with PFIG-generated data increasing AUC by 3% and CLIP-based encoders showing strong advantages. The approach also successfully enhances multimodal Large Language Models, enabling them to provide detailed reasoning for deepfake detection.
Approach
The Visual-Linguistic Face Forgery Detection (VLFFD) approach generates mixed forgery images with fine-grained, sentence-level prompts (detailing forgery region and type) using a Prompt Forgery Image Generator (PFIG). It then employs a Coarse-and-Fine Co-training framework (C2F) to jointly train multimodal encoders (image and text) using both these fine-grained generated data and coarse-grained original data, enhancing generalization and interpretability through multimodal contrastive learning.
Datasets
FaceForensics++, DFDC-P, DFD, Celeb-DF, Wild-Deepfake
Model(s)
CLIP (with ViT-L as image encoder), Xception, EN-B4, ViT-B, MiniGPT-4 (for LLM integration)
Author countries
China