AI-Powered Deepfake Detection Using CNN and Vision Transformer Architectures

Authors: Sifatullah Sheikh Urmi, Kirtonia Nuzath Tabassum Arthi, Md Al-Imran

Published: 2026-01-03 20:44:50+00:00

AI Summary

This paper investigates the effectiveness of four AI models, including three CNNs and a Vision Transformer, for classifying real versus fake face images (deepfake detection). Utilizing preprocessing and augmentation, the study compares the robustness of these architectures on a large dataset. The proposed Vision Fake Detection Network (VFDNET), based on the Vision Transformer, achieved the highest accuracy, demonstrating reliable deepfake detection capabilities.

Abstract

The increasing use of artificial intelligence generated deepfakes creates major challenges in maintaining digital authenticity. Four AI-based models, consisting of three CNNs and one Vision Transformer, were evaluated using large face image datasets. Data preprocessing and augmentation techniques improved model performance across different scenarios. VFDNET demonstrated superior accuracy with MobileNetV3, showing efficient performance, thereby demonstrating AI's capabilities for dependable deepfake detection.


Key findings
The Vision Transformer-based VFDNET achieved superior performance, recording the highest accuracy at 99.13% and an F1-Score of 99.00%. MobileNetV3 also performed robustly with 98.00% accuracy, placing it second in performance. The results indicate that transformer-based and lightweight architectures are more effective for deepfake image detection compared to traditional, heavier CNNs like ResNet50.
Approach
The approach involves evaluating four models—a custom CNN (DFCNET), two pre-trained CNNs (MobileNetV3, ResNet50), and a Vision Transformer (VFDNET)—on a large dataset of real and fake face images. All data underwent resizing and normalization, and robust data augmentation techniques (like AutoAugment) were applied to improve generalization for binary classification.
Datasets
140K Real and Fake Faces Dataset (sourced from Flickr and StyleGAN generated images).
Model(s)
DFCNET, MobileNetV3, ResNet50, VFDNET (Vision Transformer based).
Author countries
Bangladesh