Exposing and Mitigating Temporal Attack in Deepfake Video Detection

Authors: Zheyuan Gu, Minghao Shao, Zhen Wang, Yusong Wang, Mingkun Xu, Shijie Zhang, Hao Jiang

Published: 2026-05-08 07:53:57+00:00

AI Summary

Existing spatiotemporal deepfake detectors are vulnerable to evasion attacks due to overfitting on fragile temporal spectrum cues rather than robust semantic causality. This paper introduces SpInShield, a temporal spectral-invariant defense framework that decouples semantic motion from manipulatable spectral artifacts through a learnable spectral adversary and shortcut suppression optimization. SpInShield achieves competitive performance on widely used datasets and significantly outperforms baselines under simulated amplitude spectral attacks.

Abstract

While spatiotemporal deepfake detectors achieve high AUC, our experiments reveal their susceptibility to evasion attacks. These models tend to overfit on fragile temporal spectrum cues, rather than learning robust semantic causality. To mitigate this vulnerability, we propose SpInShield, a temporal spectral-invariant defense framework explicitly designed to decouple semantic motion from manipulatable spectral artifacts. We propose a learnable spectral adversary that dynamically synthesizes severe spectral deformations, simulating extreme attack scenarios. By employing a shortcut suppression optimization strategy, SpInShield compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space. Experiments show that SpInShield obtains competitive performance on widely used datasets and outperforms the strongest baseline by 21.30 percentage points in AUC under simulated amplitude spectral attacks.


Key findings
SpInShield achieves state-of-the-art cross-domain deepfake detection with an average AUC of 93.3%, surpassing the second-best competitor by 1.6%. It demonstrates significant robustness against various temporal amplitude spectral attacks and real-world video processing pipelines, showing an average AUC gain of over 20 percentage points compared to strong baselines under synthetic attacks. Furthermore, SpInShield effectively extracts phase semantics, maintaining 79.6% AUC even when amplitude spectrum information is entirely removed.
Approach
SpInShield mitigates temporal attacks by proposing a learnable spectral adversary (LSA) that synthesizes severe amplitude spectral deformations while preserving phase. It employs a Siamese VideoMAE V2 encoder to process both original and adversarially perturbed videos. A shortcut suppression optimization strategy, combining symmetric prediction invariance and spectral blindness, compels the encoder to extract robust phase-consistent forensic cues while purging unstable spectral statistics.
Datasets
Celeb-DF-v2 (CDF-v2), DeepFakeDetection (DFD), Deeperforensics (DFo), WildDeepFake (WDF), FaceForensics++ (FF++), DiffSwap (on FFHQ), DaGAN (on FFHQ)
Model(s)
ViT-based VideoMAE V2 (backbone encoder)
Author countries
China, USA, Japan