A Novel Unified Approach to Deepfake Detection

View on arXiv ← Back to list

Authors: Lord Sen, Shyamapada Mukherjee

Published: 2026-01-06 19:30:53+00:00

AI Summary

This paper presents a novel unified architecture for deepfake detection in images and videos, combining spatial and frequency domain analysis through a cross-attention mechanism. The architecture utilizes backbones like Swin Transformer or EfficientNet-B4 along with BERT for feature extraction and integrates a parallel module for blood detection under the skin. The proposed method achieves state-of-the-art results, including 99.80% and 99.88% AUC on FF++ and Celeb-DF datasets, respectively, demonstrating excellent generalization across domains.

Abstract

The advancements in the field of AI is increasingly giving rise to various threats. One of the most prominent of them is the synthesis and misuse of Deepfakes. To sustain trust in this digital age, detection and tagging of deepfakes is very necessary. In this paper, a novel architecture for Deepfake detection in images and videos is presented. The architecture uses cross attention between spatial and frequency domain features along with a blood detection module to classify an image as real or fake. This paper aims to develop a unified architecture and provide insights into each step. Though this approach we achieve results better than SOTA, specifically 99.80%, 99.88% AUC on FF++ and Celeb-DF upon using Swin Transformer and BERT and 99.55, 99.38 while using EfficientNet-B4 and BERT. The approach also generalizes very well achieving great cross dataset results as well.

Key findings

The proposed architecture achieved high performance, reaching 99.80% AUC on FF++ and 99.88% AUC on Celeb-DF when utilizing the Swin Transformer and BERT combination. The model also demonstrated strong cross-dataset generalization, scoring 94.01% AUC on Celeb-DF when trained solely on FF++, indicating robustness against different generation techniques.

Approach

The approach uses a dual-stream feature encoder that processes the input image's spatial features and its frequency domain features (obtained via DFT). These features are fused using a Cross-Stream Attention Fusion (CSAF) module. A parallel stream performs cross-attention-based blood detection underneath the skin, and its resulting classification probability is combined with the main stream's output via a weighted average for the final deepfake prediction.

Datasets

FaceForensics++ (FF++), Celeb-DF (CDF), WildDeepfake (WDF), DeepFakeDetection (DFD), DeepFake Detection Challenge (DFDC)

Model(s)

Swin Transformer, EfficientNet-B4, BERT, DistilBERT (compared), MobileNetV3, ResNet32, ResNet50 (general backbones mentioned)

Author countries

India

← Previous