Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection

Authors: Kutub Uddin, Nusrat Tasnim, Byung Tae Oh

Published: 2026-03-27 03:12:24+00:00

AI Summary

The paper proposes Face2Parts, a novel hybrid deepfake detection approach leveraging hierarchical feature representation (HFR). It extracts features from the frame, face, and key facial regions, using a channel-attention mechanism and deep triplet learning to capture inter-dependencies. This method improves deepfake detection and generalization across diverse benchmark datasets and manipulations.

Abstract

Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation ($HFR$) that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42\\% on FF++, 79.80\\% on CDF1, 85.34\\% on CDF2, 89.41\\% on DFD, 84.07\\% on DFDC, 95.62\\% on DTIM, 80.76\\% on PDD, and 100\\% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.


Key findings
The Face2Parts method consistently outperforms face-only approaches and most state-of-the-art methods across intra- and inter-dataset evaluations. It achieves high average AUCs (e.g., 98.42% on FF++, 84.07% on DFDC, 100% on WLDR) and demonstrates strong generalization to unseen datasets and manipulation types. The hierarchical feature representation effectively captures discriminative patterns, enhancing detection robustness.
Approach
The Face2Parts approach utilizes Hierarchical Feature Representation (HFR), extracting multi-level features from coarse (frame), medium (face), and fine-grained (lips, eyes, nose) facial regions. It employs a channel-attention mechanism and deep triplet learning to capture inter-dependencies among these regional features, generating discriminative embeddings for classification.
Datasets
FF++ (FaceForensics++), Celeb-DF-V1 (CDF1), Celeb-DF-V2 (CDF2), Deepfake Detection (DFD), Deepfake Detection Challenge (DFDC), DeepfakeTIMIT (DTIM), Presidential Deepfake Dataset (PDD), World Leaders (WLDR).
Model(s)
For feature extraction: CLIP ViT-B16, ViT-B32, ViT-L14, MViT, TVC, ViViT. For learning: Channel-Attention-based Multi-Layer Perceptron (CA-MLP) and deep triplet learning. For classification: Fully Connected (FC) layer.
Author countries
South Korea