Unmasking Facial DeepFakes: A Robust Multiview Detection Framework for Natural Images

View on arXiv ← Back to list

Authors: Sami Belguesmia, Mohand Saïd Allili, Assia Hamadene

Published: 2025-10-17 12:16:04+00:00

AI Summary

The paper proposes a robust multi-view architecture for facial DeepFake detection designed to counter issues arising from pose variations and real-world conditions. The framework integrates three specialized encoders—global, middle, and local views—to analyze artifacts across different spatial scales of the face. By fusing these multi-view features with inputs from a dedicated face orientation encoder, the model achieves superior detection performance compared to conventional single-view approaches.

Abstract

DeepFake technology has advanced significantly in recent years, enabling the creation of highly realistic synthetic face images. Existing DeepFake detection methods often struggle with pose variations, occlusions, and artifacts that are difficult to detect in real-world conditions. To address these challenges, we propose a multi-view architecture that enhances DeepFake detection by analyzing facial features at multiple levels. Our approach integrates three specialized encoders, a global view encoder for detecting boundary inconsistencies, a middle view encoder for analyzing texture and color alignment, and a local view encoder for capturing distortions in expressive facial regions such as the eyes, nose, and mouth, where DeepFake artifacts frequently occur. Additionally, we incorporate a face orientation encoder, trained to classify face poses, ensuring robust detection across various viewing angles. By fusing features from these encoders, our model achieves superior performance in detecting manipulated images, even under challenging pose and lighting conditions.Experimental results on challenging datasets demonstrate the effectiveness of our method, outperforming conventional single-view approaches

Key findings

The fusion of multiple spatial views significantly outperformed any single-view approach, demonstrating the benefit of multi-scale analysis. The best performing model, the CNN (ResNet50) fusion + pose variant, achieved high AUC scores (98.49% on OpenForensics and 99.88% on FaceForensics++), consistently outperforming state-of-the-art baselines. Crucially, the incorporation of the face orientation encoder led to a noticeable increase in classification performance, reinforcing the robustness of the system against pose variations.

Approach

The methodology employs a four-branch detection framework where the face image is cropped into global, middle, and local views for feature extraction using specialized encoders (either CNN-based ResNet50 or Transformer-based BeiT). A fourth MobileNet-based encoder classifies the face orientation (pose). Features extracted from all three spatial views and the pose encoder are subsequently fused via an MLP layer to produce the final deepfake classification.

Datasets

OpenForensics, FaceForensics++

Model(s)

ResNet50, BeiT, MobileNet

Author countries

Canada

← Previous