Self-Supervised Graph Transformer for Deepfake Detection

View on arXiv ← Back to list

Authors: Aminollah Khormali, Jiann-Shiun Yuan

Published: 2023-07-27 17:22:41+00:00

AI Summary

This paper proposes a deepfake detection framework using a self-supervised graph Transformer. The framework leverages contrastive learning for pre-training a vision Transformer feature extractor, followed by graph convolutional networks and a Transformer discriminator for improved generalization and explainability.

Abstract

Deepfake detection methods have shown promising results in recognizing forgeries within a given dataset, where training and testing take place on the in-distribution dataset. However, their performance deteriorates significantly when presented with unseen samples. As a result, a reliable deepfake detection system must remain impartial to forgery types, appearance, and quality for guaranteed generalizable detection performance. Despite various attempts to enhance cross-dataset generalization, the problem remains challenging, particularly when testing against common post-processing perturbations, such as video compression or blur. Hence, this study introduces a deepfake detection framework, leveraging a self-supervised pre-training model that delivers exceptional generalization ability, withstanding common corruptions and enabling feature explainability. The framework comprises three key components: a feature extractor based on vision Transformer architecture that is pre-trained via self-supervised contrastive learning methodology, a graph convolution network coupled with a Transformer discriminator, and a graph Transformer relevancy map that provides a better understanding of manipulated regions and further explains the model's decision. To assess the effectiveness of the proposed framework, several challenging experiments are conducted, including in-data distribution performance, cross-dataset, cross-manipulation generalization, and robustness against common post-production perturbations. The results achieved demonstrate the remarkable effectiveness of the proposed deepfake detection framework, surpassing the current state-of-the-art approaches.

Key findings

The proposed method outperforms state-of-the-art approaches in cross-dataset generalization and robustness against common post-processing perturbations. The self-supervised pre-training significantly improves generalization ability. The graph Transformer architecture effectively captures complex interdependencies within the image.

Approach

The approach uses a self-supervised contrastive learning method to pre-train a vision Transformer for feature extraction. These features are then used to construct a graph representation of the image, which is fed into a graph convolutional network and a Transformer discriminator for deepfake classification. A graph Transformer relevancy map provides explainability.

Datasets

FaceForensics++, Celeb-DF (V2), WildDeepfake, DeeperForensics, FaceShifter, DeepFake Detection Challenge (DFDC)

Model(s)

Vision Transformer (ViT), Graph Convolutional Network (GCN), Transformer discriminator, ResNet50 (for comparison)

Author countries

USA

← Previous