LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild

Authors: Fei Wu, Dagong Lu, Mufeng Yao, Xinlei Xu, Fengjun Guo

Published: 2026-04-04 02:56:11+00:00

Comment: 2nd place (out of 94 teams) in the NTIRE 2026 Robust Deepfake Detection Challenge

AI Summary

LOGER proposes a Local–Global Ensemble framework for robust deepfake detection by combining global-level anomaly detection with local-level forgery trace identification. It leverages heterogeneous vision foundation models at multiple resolutions for global analysis and a Multiple Instance Learning (MIL) top-k aggregation strategy for local patch-level modeling. This approach significantly enhances robustness and generalization across diverse manipulation methods and real-world degradation conditions, achieving 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge.

Abstract

Robust deepfake detection in the wild remains challenging due to the ever-growing variety of manipulation techniques and uncontrolled real-world degradations. Forensic cues for deepfake detection reside at two complementary levels: global-level anomalies in semantics and statistics that require holistic image understanding, and local-level forgery traces concentrated in manipulated regions that are easily diluted by global averaging. Since no single backbone or input scale can effectively cover both levels, we propose LOGER, a LOcal--Global Ensemble framework for Robust deepfake detection. The global branch employs heterogeneous vision foundation model backbones at multiple resolutions to capture holistic anomalies with diverse visual priors. The local branch performs patch-level modeling with a Multiple Instance Learning top-$k$ aggregation strategy that selectively pools only the most suspicious regions, mitigating evidence dilution caused by the dominance of normal patches; dual-level supervision at both the aggregated image level and individual patch level keeps local responses discriminative. Because the two branches differ in both granularity and backbone, their errors are largely decorrelated, a property that logit-space fusion exploits for more robust prediction. LOGER achieves 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge, and further evaluation on multiple public benchmarks confirms its strong robustness and generalization across diverse manipulation methods and real-world degradation conditions.


Key findings
LOGER achieved 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge and demonstrated strong generalization across various public benchmarks. It consistently outperformed state-of-the-art methods, showing superior robustness under diverse real-world degradation conditions like JPEG compression, resizing, and blurring. The local-global ensemble and logit-space fusion were confirmed to be crucial for this robust performance.
Approach
LOGER employs a two-branch ensemble: a global branch uses heterogeneous vision foundation model backbones (DINOv3-H, MetaCLIP2-H) at multiple resolutions to capture holistic anomalies. The local branch uses DINOv3-L for patch-level modeling with a Multiple Instance Learning (MIL) top-k aggregation strategy to focus on suspicious regions, using dual-level supervision. Predictions from both branches are then fused in the logit space for robust final classification.
Datasets
NTIRE 2026 Robust Deepfake Detection Challenge, HydraFake, FaceForensics++, DF40, Celeb-DF, ScaleDF, CDF-v2, DFD, DFDC, DFDCP, WDF
Model(s)
DINOv3-Huge, DINOv3-Large, MetaCLIP2-Huge
Author countries
China