DeiTFake: Deepfake Detection Model using DeiT Multi-Stage Training

Authors: Saksham Kumar, Ashish Singh, Srinivasarao Thota, Sunil Kumar Singh, Chandan Kumar

Published: 2025-11-15 05:55:09+00:00

AI Summary

The paper introduces DeiTFake, a deepfake detection model leveraging a DeiT-based transformer and a novel two-stage progressive training strategy. This curriculum learning approach applies initial transfer learning with standard augmentations, followed by fine-tuning using advanced affine and deepfake-specific augmentations to boost robustness. DeiTFake achieved 99.22% accuracy and 0.9997 AUROC on the OpenForensics dataset, setting a new state-of-the-art benchmark.

Abstract

Deepfakes are major threats to the integrity of digital media. We propose DeiTFake, a DeiT-based transformer and a novel two-stage progressive training strategy with increasing augmentation complexity. The approach applies an initial transfer-learning phase with standard augmentations followed by a fine-tuning phase using advanced affine and deepfake-specific augmentations. DeiT's knowledge distillation model captures subtle manipulation artifacts, increasing robustness of the detection model. Trained on the OpenForensics dataset (190,335 images), DeiTFake achieves 98.71\\% accuracy after stage one and 99.22\\% accuracy with an AUROC of 0.9997, after stage two, outperforming the latest OpenForensics baselines. We analyze augmentation impact and training schedules, and provide practical benchmarks for facial deepfake detection.


Key findings
DeiTFake reached a peak accuracy of 99.22% and an AUROC of 0.9997 on the OpenForensics test set, significantly outperforming prior baselines. The ablation study confirmed that the two-stage progressive training structure, specifically the introduction of complex affine transformations in Stage II, measurably improved the model's generalization and geometric robustness.
Approach
The approach utilizes a pre-trained DeiT Vision Transformer and employs a Dual-Phase Optimized progressive training framework. Stage I performs standard transfer learning; Stage II fine-tunes the model by progressively increasing data augmentation complexity, specifically introducing advanced affine transformations (like ColorJitter, Random Perspective, and Elastic Transform) to enhance geometric robustness against manipulation artifacts.
Datasets
OpenForensics (used for primary training and evaluation), ImageNet (used for pre-trained weights of the DeiT backbone).
Model(s)
DeiT-base-patch16-224 (Data-Efficient Image Transformer/Vision Transformer).
Author countries
India