Towards Generalizable Deepfake Image Detection with Vision Transformers

View on arXiv ← Back to list

Authors: Kaliki V Srinanda, M Manvith Prabhu, Hemanth K Mogilipalem, Jayavarapu S Abhinai, Vaibhav Santhosh, Aryan Herur, Deepu Vijayasenan

Published: 2026-04-19 10:59:17+00:00

Comment: 5 pages, 9 figures, SP Cup - ICASSP 2025

AI Summary

This paper introduces a generalizable method for deepfake image detection using an ensemble of fine-tuned vision transformers, including DINOv2, AIMv2, and OpenCLIP's ViT-L/14. The approach leverages the challenging and diverse DF-Wild dataset to create a robust detector. Experimental results demonstrate that the ensemble significantly outperforms individual models and strong CNN baselines, setting a new state-of-the-art on the DF-Wild test set.

Abstract

In today's day and age, we face a challenge in detecting deepfake images because of the fast evolution of modern generative models and the poor generalization capability of existing methods. In this paper, we use an ensemble of fine-tuned vision transformers like DINOv2, AIMv2 and OpenCLIP's ViT-L/14 to create generalizable method to detect deepfakes. We use the DF-Wild dataset released as part of the IEEE SP Cup 2025, because it uses a challenging and diverse set of manipulations and generation techniques. We started our experiments with CNN classifiers trained on spatial features. Experimental results show that our ensemble outperforms individual models and strong CNN baselines, achieving an AUC of 96.77% and an Equal Error Rate (EER) of just 9% on the DF-Wild test set, beating the state-of-the-art deepfake detection algorithm Effort by 7.05% and 8% in AUC and EER respectively. This was the winning solution for SP Cup, presented at ICASSP 2025.

Key findings

The ensemble of Vision Transformers achieved an AUC of 96.77% and an EER of 9% on the DF-Wild test set. This performance significantly surpassed individual models and strong CNN baselines. The proposed method also outperformed the state-of-the-art algorithm Effort by 7.05% in AUC and 8% in EER, highlighting improved generalization capabilities for diverse deepfake manipulations.

Approach

The authors propose an ensemble of fine-tuned vision transformers (DINOv2, AIMv2, and OpenCLIP's ViT-L/14) to detect deepfake images. These pre-trained models are adapted for binary classification by adding dense layers and fine-tuned on the DF-Wild dataset. The final prediction is obtained by averaging the sigmoid outputs of the individual models in the ensemble.

Datasets

DF-Wild dataset (a compilation of Celeb-DF-v1 and v2, FaceForensics++, DeepfakeDetection, FaceShifter, UADFV, Deepfake Detection Challenge Preview, and Deepfake Detection Challenge)

Model(s)

UNKNOWN

Author countries

India

← Previous