Hierarchical Deep Fusion Framework for Multi-dimensional Facial Forgery Detection - The 2024 Global Deepfake Image Detection Challenge

Authors: Kohou Wang, Huan Hu, Xiang Liu, Zezhou Chen, Ping Chen, Zhaoxiang Liu, Shiguo Lian

Published: 2025-09-16 14:06:54+00:00

AI Summary

The Hierarchical Deep Fusion Framework (HDFF) is an ensemble-based deep learning architecture for facial forgery detection. It integrates four pre-trained models (Swin-MLP, CoAtNet, EfficientNetV2, and DaViT), achieving a 0.96852 score on the private leaderboard of the 2024 Global Deepfake Image Detection Challenge.

Abstract

The proliferation of sophisticated deepfake technology poses significant challenges to digital security and authenticity. Detecting these forgeries, especially across a wide spectrum of manipulation techniques, requires robust and generalized models. This paper introduces the Hierarchical Deep Fusion Framework (HDFF), an ensemble-based deep learning architecture designed for high-performance facial forgery detection. Our framework integrates four diverse pre-trained sub-models, Swin-MLP, CoAtNet, EfficientNetV2, and DaViT, which are meticulously fine-tuned through a multi-stage process on the MultiFFDI dataset. By concatenating the feature representations from these specialized models and training a final classifier layer, HDFF effectively leverages their collective strengths. This approach achieved a final score of 0.96852 on the competition's private leaderboard, securing the 20th position out of 184 teams, demonstrating the efficacy of hierarchical fusion for complex image classification tasks.


Key findings
HDFF achieved a score of 0.96852 on the private leaderboard, ranking 20th out of 184 teams. The study highlights the importance of full-network fine-tuning and the CosineAnnealingLR learning rate scheduler for optimal performance in deepfake detection.
Approach
HDFF uses an ensemble approach by concatenating the feature representations from four diverse pre-trained models (Swin-MLP, CoAtNet, EfficientNetV2, and DaViT) and training a final classifier layer. A multi-stage training strategy, involving selective and comprehensive fine-tuning, is employed to optimize performance.
Datasets
MultiFFDI dataset, ImageNet-1K (for pre-training)
Model(s)
Swin-MLP, CoAtNet, EfficientNetV2, DaViT
Author countries
China