DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

Authors: Yinqi Cai, Jichang Li, Zhaolun Li, Weikai Chen, Rushi Lan, Xi Xie, Xiaonan Luo, Guanbin Li

Published: 2025-10-29 07:35:29+00:00

AI Summary

DeepShield is a novel deepfake video detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. It enhances the CLIP-ViT encoder through Local Patch Guidance (LPG), which captures fine-grained inconsistencies via patch-wise supervision, and Global Forgery Diversification (GFD), which synthesizes diverse forgeries to enhance cross-domain adaptability. DeepShield successfully mitigates overfitting to specific artifacts, achieving superior performance in cross-dataset evaluations.

Abstract

Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.


Key findings
DeepShield achieved state-of-the-art performance in cross-dataset evaluations, notably surpassing previous best methods on the challenging DFDCP and DFDC datasets by 6.9% and 5.8% AUC, respectively. The combination of LPG and GFD components was shown through ablation studies to be necessary, resulting in strong feature separation between real and manipulated data and superior generalization against unseen attacks.
Approach
The method fine-tunes a CLIP-ViT encoder augmented with an ST-Adapter. LPG uses Spatiotemporal Artifact Modeling (SAM) to generate blended videos with manipulation masks, providing patch-level supervision for local artifact detection. GFD implements Domain Feature Augmentation (DFA), including Domain-Bridging and Boundary-Expanding feature generation, to synthesize diverse forgery representations, trained using a combination of cross-entropy and supervised contrastive loss.
Datasets
FaceForensics++ (FF++), CelebDF v2 (CDF), DeepFake Detection Challenge (DFDC), DFDC Preview (DFDCP), Deepfake Detection (DFD).
Model(s)
CLIP ViT-B/16 (with ST-Adapter)
Author countries
China