M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection

Authors: Haotian Wu, Yue Cheng, Shan Bian

Published: 2026-04-16 03:06:23+00:00

AI Summary

This paper introduces M3D-Net, a novel Multi-Modal 3D Facial Feature Reconstruction Network for deepfake detection, addressing the limitations of existing methods that fail to fully exploit multi-modal feature representations. M3D-Net employs an end-to-end dual-stream architecture to reconstruct fine-grained facial geometry and reflectance from single-view RGB images using a self-supervised 3D facial reconstruction module. It enhances detection performance through a 3D Feature Pre-fusion Module (PFM) and a Multi-modal Fusion Module (MFM) that effectively integrates RGB and 3D-reconstructed features via attention mechanisms.

Abstract

With the rapid advancement of deep learning in image generation, facial forgery techniques have achieved unprecedented realism, posing serious threats to cybersecurity and information authenticity. Most existing deepfake detection approaches rely on the reconstruction of isolated facial attributes without fully exploiting the complementary nature of multi-modal feature representations. To address these challenges, this paper proposes a novel Multi-Modal 3D Facial Feature Reconstruction Network (M3D-Net) for deepfake detection. Our method leverages an end-to-end dual-stream architecture that reconstructs fine-grained facial geometry and reflectance properties from single-view RGB images via a self-supervised 3D facial reconstruction module. The network further enhances detection performance through a 3D Feature Pre-fusion Module (PFM), which adaptively adjusts multi-scale features, and a Multi-modal Fusion Module (MFM) that effectively integrates RGB and 3D-reconstructed features using attention mechanisms. Extensive experiments on multiple public datasets demonstrate that our approach achieves state-of-the-art performance in terms of detection accuracy and robustness, significantly outperforming existing methods while exhibiting strong generalization across diverse scenarios.


Key findings
M3D-Net achieves state-of-the-art performance in intra-dataset evaluations on FaceForensics++ (c23) and demonstrates strong generalization in cross-resolution testing. It generally outperforms existing methods in cross-dataset evaluations, particularly on Celeb-DF v1, Celeb-DF v2, DeeperForensics-1.0, and DFD. Ablation studies confirm that the EfficientNet backbone, the 3D Feature Pre-Fusion Module (PFM), and the attention mechanism in the Multimodal Fusion Module (MFM) significantly contribute to the model's enhanced detection accuracy.
Approach
M3D-Net utilizes a dual-stream architecture comprising an RGB branch and a 3D feature reconstruction branch. The 3D branch reconstructs facial depth and albedo from single-view RGB images using a self-supervised module (Unsup3D). The reconstructed 3D features are then integrated via a 3D Feature Pre-fusion Module (PFM), while RGB features are processed separately; both streams use EfficientNet-B4 as a backbone. Finally, a Multi-modal Fusion Module (MFM) integrates the RGB and 3D features using attention mechanisms for classification.
Datasets
FaceForensics++ (FF++), Deepfake Detection Challenge (DFDC), DeepfakeDetection (DFD), Deepfake Detection Challenge preview (DFDCP), FaceShifter (Fsh), Celeb-DF v1 (CDFv1), Celeb-DF v2 (CDFv2), DeeperForensics-1.0 (DF-1.0)
Model(s)
UNKNOWN
Author countries
China