Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

View on arXiv ← Back to list

Authors: Zhiyuan Yan, Yandan Zhao, Shen Chen, Mingyi Guo, Xinghe Fu, Taiping Yao, Shouhong Ding, Li Yuan

Published: 2024-08-30 07:49:57+00:00

AI Summary

This paper addresses challenges in deepfake video detection by proposing a novel Video-level Blending (VB) data synthesis method to capture the Facial Feature Drift (FFD) artifact and a lightweight Spatiotemporal Adapter (StA) to efficiently integrate spatial and temporal features from pre-trained image models.

Abstract

Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how can we tackle efficiency without compromising accuracy? This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. To reproduce FFD, we then propose a novel Video-level Blending data (VB), where VB is implemented by blending the original image and its warped version frame-by-frame, serving as a hard negative sample to mine more general artifacts. Second, we carefully design a lightweight Spatiotemporal Adapter (StA) to equip a pretrained image model (both ViTs and CNNs) with the ability to capture both spatial and temporal features jointly and efficiently. StA is designed with two-stream 3D-Conv with varying kernel sizes, allowing it to process spatial and temporal features separately. Extensive experiments validate the effectiveness of the proposed methods; and show our approach can generalize well to previously unseen forgery videos, even the latest generation methods.

Key findings

The proposed VB and StA methods significantly improve deepfake detection accuracy and generalization across various datasets and forgery techniques. The approach outperforms existing state-of-the-art methods in cross-dataset and cross-manipulation evaluations, demonstrating its robustness and generalizability.

Approach

The authors propose a Video-level Blending (VB) data augmentation technique to simulate Facial Feature Drift (FFD), a subtle temporal inconsistency in deepfakes. They also introduce a Spatiotemporal Adapter (StA) that enhances pre-trained image models with temporal reasoning capabilities by adding only a small number of learnable parameters.

Datasets

FaceForensics++ (FF++) (c23 version), Celeb-DF-v2 (CDF-v2), DeepfakeDetection (DFD), Deepfake Detection Challenge (DFDC), Deepfake Detection Challenge Preview (DFDCP), DeeperForensics (DFo), WildDeepfake (WDF), FFIW, DF40

Model(s)

CLIP ViT-l14, ResNet-34, other architectures mentioned but not specified as primary models

Author countries

China

← Previous