Unmasking Facial DeepFakes: A Robust Multiview Detection Framework for Natural Images

Authors: Sami Belguesmia, Mohand Saïd Allili, Assia Hamadene

Published: 2025-10-17 12:16:04+00:00

AI Summary

This paper proposes a robust multi-view architecture for DeepFake detection in natural face images, addressing challenges like pose variations and occlusions. It integrates three specialized encoders—global, middle, and local—to analyze facial features at multiple levels, along with a face orientation encoder. By fusing features from these encoders, the model achieves superior performance in detecting manipulated images, even under challenging conditions, outperforming conventional single-view approaches.

Abstract

DeepFake technology has advanced significantly in recent years, enabling the creation of highly realistic synthetic face images. Existing DeepFake detection methods often struggle with pose variations, occlusions, and artifacts that are difficult to detect in real-world conditions. To address these challenges, we propose a multi-view architecture that enhances DeepFake detection by analyzing facial features at multiple levels. Our approach integrates three specialized encoders, a global view encoder for detecting boundary inconsistencies, a middle view encoder for analyzing texture and color alignment, and a local view encoder for capturing distortions in expressive facial regions such as the eyes, nose, and mouth, where DeepFake artifacts frequently occur. Additionally, we incorporate a face orientation encoder, trained to classify face poses, ensuring robust detection across various viewing angles. By fusing features from these encoders, our model achieves superior performance in detecting manipulated images, even under challenging pose and lighting conditions.Experimental results on challenging datasets demonstrate the effectiveness of our method, outperforming conventional single-view approaches


Key findings
The fusion of multiple views significantly enhances detection performance compared to single-view approaches, and incorporating face orientation information further improves classification robustness. The CNN-based variant (ResNet50) showed slightly better performance than the BeiT-based variant. The proposed method achieved a slight but consistent advantage in AUC scores over existing state-of-the-art methods on challenging datasets, demonstrating stronger generalization.
Approach
The authors propose a multi-view architecture with three specialized encoders: a global view for boundary inconsistencies, a middle view for texture/color alignment, and a local view for expressive facial regions (eyes, nose, mouth). Additionally, a face orientation encoder is used to classify face poses. Features from these encoders are fused via a multi-layer perceptron to enhance detection robustness across various viewing angles.
Datasets
OpenForensics [19], FaceForensics++ [28]
Model(s)
ResNet50 [32] (for CNN-based view encoders), BeiT [31] (for Transformer-based view encoders), MobileNet (for face orientation encoder), Multi-layer perceptron (MLP) (for feature fusion)
Author countries
Canada