Morphology-optimized Multi-Scale Fusion: Combining Local Artifacts and Mesoscopic Semantics for Deepfake Detection and Localization

Authors: Chao Shuai, Gaojian Wang, Kun Pan, Tong Wu, Fanli Jin, Haohan Tan, Mengxiang Li, Zhenguang Liu, Feng Lin, Kui Ren

Published: 2025-09-17 07:46:07+00:00

AI Summary

This paper proposes a novel deepfake detection and localization framework that combines local and mesoscopic forgery cues. It uses morphological operations to fuse independently predicted manipulated regions from local and global perspectives, enhancing spatial coherence and suppressing noise. Extensive experiments demonstrate improved accuracy and robustness in forgery localization.

Abstract

While the pursuit of higher accuracy in deepfake detection remains a central goal, there is an increasing demand for precise localization of manipulated regions. Despite the remarkable progress made in classification-based detection, accurately localizing forged areas remains a significant challenge. A common strategy is to incorporate forged region annotations during model training alongside manipulated images. However, such approaches often neglect the complementary nature of local detail and global semantic context, resulting in suboptimal localization performance. Moreover, an often-overlooked aspect is the fusion strategy between local and global predictions. Naively combining the outputs from both branches can amplify noise and errors, thereby undermining the effectiveness of the localization. To address these issues, we propose a novel approach that independently predicts manipulated regions using both local and global perspectives. We employ morphological operations to fuse the outputs, effectively suppressing noise while enhancing spatial coherence. Extensive experiments reveal the effectiveness of each module in improving the accuracy and robustness of forgery localization.


Key findings
The proposed framework significantly outperforms existing methods in deepfake localization, achieving higher F1-score and IoU. The morphological mask fusion strategy proves crucial for improving spatial coherence and suppressing noise in localization predictions. The combination of local and global perspectives leads to more robust and accurate results.
Approach
The approach uses two separate networks: one focusing on local facial forgery details (LFDL) and another capturing mesoscopic semantics across the entire image (MITL). The outputs of these networks are fused using morphological operations (dilation and erosion) to refine the localization masks, improving accuracy and coherence.
Datasets
Deepfake Detection and Localization Image (DDL-I) dataset, containing over 1.5 million samples with pixel-level annotations of 61 deepfake methods across four forgery types.
Model(s)
LFDL uses an Xception backbone, while MITL uses a hybrid Segformer-B3 and ConvNeXt-Tiny backbone. Both models have classification and localization branches and employ various techniques like cross-modality consistency enhancement, attention mechanisms, and multi-scale feature fusion.
Author countries
China