Self-Supervised Graph Transformer for Deepfake Detection

Authors: Aminollah Khormali, Jiann-Shiun Yuan

Published: 2023-07-27 17:22:41+00:00

AI Summary

This study introduces a deepfake detection framework leveraging a self-supervised pre-training model to achieve exceptional generalization and robustness against unseen samples and post-processing perturbations. The framework integrates a Vision Transformer feature extractor, a graph convolution network, and a Transformer discriminator. It also includes a graph Transformer relevancy map for explainability by highlighting manipulated regions, demonstrating state-of-the-art performance across various challenging deepfake detection scenarios.

Abstract

Deepfake detection methods have shown promising results in recognizing forgeries within a given dataset, where training and testing take place on the in-distribution dataset. However, their performance deteriorates significantly when presented with unseen samples. As a result, a reliable deepfake detection system must remain impartial to forgery types, appearance, and quality for guaranteed generalizable detection performance. Despite various attempts to enhance cross-dataset generalization, the problem remains challenging, particularly when testing against common post-processing perturbations, such as video compression or blur. Hence, this study introduces a deepfake detection framework, leveraging a self-supervised pre-training model that delivers exceptional generalization ability, withstanding common corruptions and enabling feature explainability. The framework comprises three key components: a feature extractor based on vision Transformer architecture that is pre-trained via self-supervised contrastive learning methodology, a graph convolution network coupled with a Transformer discriminator, and a graph Transformer relevancy map that provides a better understanding of manipulated regions and further explains the model's decision. To assess the effectiveness of the proposed framework, several challenging experiments are conducted, including in-data distribution performance, cross-dataset, cross-manipulation generalization, and robustness against common post-production perturbations. The results achieved demonstrate the remarkable effectiveness of the proposed deepfake detection framework, surpassing the current state-of-the-art approaches.


Key findings
The proposed framework achieved exceptional in-dataset detection accuracy and significantly improved cross-dataset and cross-manipulation generalization, outperforming state-of-the-art methods with an average AUC of 90.8% for cross-dataset generalization. It also demonstrated high resilience to common post-processing perturbations, such as compression, blur, and noise, achieving a 96.2% average AUC across various corruption types.
Approach
The framework utilizes a self-supervised contrastive learning methodology to pre-train a Vision Transformer as a feature extractor, extracting high-level visual representations. These features form nodes in a graph, which is then processed by a graph convolution network coupled with a Transformer discriminator for classification. A graph Transformer relevancy map provides insights into manipulated regions.
Datasets
FaceForensics++, Celeb-DF (V2), WildDeepfake, DeeperForensics, FaceShifter, DeepFake Detection Challenge (DFDC)
Model(s)
Vision Transformer (ViT), Graph Convolutional Network (GCN), Transformer (for discriminator)
Author countries
USA