Towards General Visual-Linguistic Face Forgery Detection

View on arXiv ← Back to list

Authors: Ke Sun, Shen Chen, Taiping Yao, Haozhe Yang, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji

Published: 2023-07-31 10:22:33+00:00

AI Summary

This paper proposes Visual-Linguistic Face Forgery Detection (VLFFD), a novel paradigm using fine-grained sentence-level prompts as annotations for deepfake detection. VLFFD generates mixed forgery images with corresponding prompts and uses a Coarse-and-Fine Co-training framework to improve generalization and interpretability, outperforming existing methods on various benchmarks.

Abstract

Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We argue that such supervisions lack semantic information and interpretability. To address this issues, in this paper, we propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation. Since text annotations are not available in current deepfakes datasets, VLFFD first generates the mixed forgery image with corresponding fine-grained prompts via Prompt Forgery Image Generator (PFIG). Then, the fine-grained mixed data and coarse-grained original data and is jointly trained with the Coarse-and-Fine Co-training framework (C2F), enabling the model to gain more generalization and interpretability. The experiments show the proposed method improves the existing detection models on several challenging benchmarks. Furthermore, we have integrated our method with multimodal large models, achieving noteworthy results that demonstrate the potential of our approach. This integration not only enhances the performance of our VLFFD paradigm but also underscores the versatility and adaptability of our method when combined with advanced multimodal technologies, highlighting its potential in tackling the evolving challenges of deepfake detection.

Key findings

VLFFD significantly outperforms state-of-the-art methods on multiple deepfake detection benchmarks, demonstrating improved generalization and interpretability. Integration with MiniGPT-4 further enhances performance and provides detailed reasoning for its classifications. The fine-grained language information and multimodal learning framework are shown to be crucial for this improvement.

Approach

VLFFD leverages a pre-trained multimodal model (CLIP) and generates mixed forgery images with fine-grained annotations (Prompt Forgery Image Generator). These data are jointly trained with a Coarse-and-Fine Co-training framework to improve generalization and interpretability.

Datasets

FaceForensics++, DFDC-P, DFD, Celeb-DF, Wild-Deepfake

Model(s)

CLIP (ViT-L as image encoder), MiniGPT-4 (for integration with multimodal LLMs)

Author countries

China

← Previous