Video Transformer for Deepfake Detection with Incremental Learning

Authors: Sohail A. Khan, Hang Dai

Published: 2021-08-11 16:22:56+00:00

AI Summary

This paper proposes a video transformer model with incremental learning for deepfake detection. It uses 3D face reconstruction to generate UV texture maps, incorporating both face images and UV maps for enhanced feature extraction. Incremental learning improves the model's generalization ability and achieves state-of-the-art performance on various deepfake datasets.

Abstract

Face forgery by deepfake is widely spread over the internet and this raises severe societal concerns. In this paper, we propose a novel video transformer with incremental learning for detecting deepfake videos. To better align the input face images, we use a 3D face reconstruction method to generate UV texture from a single input face image. The aligned face image can also provide pose, eyes blink and mouth movement information that cannot be perceived in the UV texture image, so we use both face images and their UV texture maps to extract the image features. We present an incremental learning strategy to fine-tune the proposed model on a smaller amount of data and achieve better deepfake detection performance. The comprehensive experiments on various public deepfake datasets demonstrate that the proposed video transformer model with incremental learning achieves state-of-the-art performance in the deepfake video detection task with enhanced feature learning from the sequenced data.


Key findings
The proposed method achieves state-of-the-art performance on several deepfake detection benchmarks. The combination of UV texture maps and the video transformer significantly improves feature learning. Incremental learning enhances generalization, allowing the model to perform well on new datasets with limited additional training data.
Approach
The approach uses a video transformer architecture with a pre-trained XceptionNet backbone to extract features from both aligned face images and their corresponding UV texture maps. Incremental learning is employed to fine-tune the model on multiple datasets sequentially, enhancing its generalization capabilities.
Datasets
FaceForensics++, DeepFake Detection Challenge (DFDC), DeepFake Detection (DFD)
Model(s)
Video Transformer with XceptionNet backbone
Author countries
United Arab Emirates