FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection

Authors: Dat Nguyen, Marcella Astrid, Enjie Ghorbel, Djamila Aouada

Published: 2024-10-29 11:36:49+00:00

AI Summary

FakeFormer addresses the suboptimal performance of Vision Transformers (ViTs) in deepfake detection by enhancing their ability to model localized forgery artifacts. It introduces an explicit attention learning mechanism, L2-Att, guided by artifact-vulnerable patches. FakeFormer achieves state-of-the-art generalization and computational efficiency across diverse datasets, without requiring extensive training data.

Abstract

Recently, Vision Transformers (ViTs) have achieved unprecedented effectiveness in the general domain of image classification. Nonetheless, these models remain underexplored in the field of deepfake detection, given their lower performance as compared to Convolution Neural Networks (CNNs) in that specific context. In this paper, we start by investigating why plain ViT architectures exhibit a suboptimal performance when dealing with the detection of facial forgeries. Our analysis reveals that, as compared to CNNs, ViT struggles to model localized forgery artifacts that typically characterize deepfakes. Based on this observation, we propose a deepfake detection framework called FakeFormer, which extends ViTs to enforce the extraction of subtle inconsistency-prone information. For that purpose, an explicit attention learning guided by artifact-vulnerable patches and tailored to ViTs is introduced. Extensive experiments are conducted on diverse well-known datasets, including FF++, Celeb-DF, WildDeepfake, DFD, DFDCP, and DFDC. The results show that FakeFormer outperforms the state-of-the-art in terms of generalization and computational cost, without the need for large-scale training datasets. The code is available at \\url{https://github.com/10Ring/FakeFormer}.


Key findings
FakeFormer achieves state-of-the-art performance in deepfake detection, demonstrating superior generalization capabilities and lower computational costs compared to existing methods. It effectively mitigates the need for large-scale training datasets and shows improved robustness against various unseen perturbations.
Approach
FakeFormer extends Vision Transformers (ViTs) by incorporating a Learning-based Local Attention (L2-Att) module. This module explicitly guides the network to focus on artifact-vulnerable patches, which are identified using blending-based data synthesis techniques. This allows ViTs to effectively capture subtle local inconsistencies characteristic of deepfakes.
Datasets
FF++, Celeb-DF (CDF1, CDF2), WildDeepfake (DFW), DeepFakeDetection (DFD), Deepfake Detection Challenge Preview (DFDCP), Deepfake Detection Challenge (DFDC)
Model(s)
FakeFormer (based on Vision Transformers), FakeSwin (based on Swin Transformers)
Author countries
Luxembourg, Tunisia