Exploring Spatial-Temporal Features for Deepfake Detection and Localization

Authors: Wu Haiwei, Zhou Jiantao, Zhang Shile, Tian Jinyu

Published: 2022-10-28 03:38:49+00:00

AI Summary

This paper proposes a Spatial-Temporal Deepfake Detection and Localization (ST-DDL) network that simultaneously leverages spatial and temporal features for detecting and localizing forged regions in videos. It introduces an Anchor-Mesh Motion (AMM) algorithm to extract precise facial micro-expression movements and a Fusion Attention (FA) module based on a Transformer architecture to effectively fuse these features. The authors also contribute a new public forgery dataset, ManualFake, which includes videos produced by commercial software and versions transmitted through online social networks.

Abstract

With the continuous research on Deepfake forensics, recent studies have attempted to provide the fine-grained localization of forgeries, in addition to the coarse classification at the video-level. However, the detection and localization performance of existing Deepfake forensic methods still have plenty of room for further improvement. In this work, we propose a Spatial-Temporal Deepfake Detection and Localization (ST-DDL) network that simultaneously explores spatial and temporal features for detecting and localizing forged regions. Specifically, we design a new Anchor-Mesh Motion (AMM) algorithm to extract temporal (motion) features by modeling the precise geometric movements of the facial micro-expression. Compared with traditional motion extraction methods (e.g., optical flow) designed to simulate large-moving objects, our proposed AMM could better capture the small-displacement facial features. The temporal features and the spatial features are then fused in a Fusion Attention (FA) module based on a Transformer architecture for the eventual Deepfake forensic tasks. The superiority of our ST-DDL network is verified by experimental comparisons with several state-of-the-art competitors, in terms of both video- and pixel-level detection and localization performance. Furthermore, to impel the future development of Deepfake forensics, we build a public forgery dataset consisting of 6000 videos, with many new features such as using widely-used commercial software (e.g., After Effects) for the production, providing online social networks transmitted versions, and splicing multi-source videos. The source code and dataset are available at https://github.com/HighwayWu/ST-DDL.


Key findings
The ST-DDL network consistently outperforms state-of-the-art methods, achieving up to 8.9% higher video-level F1 and 4.1% higher pixel-level IoU on various testing datasets. Ablation studies confirm that the AMM algorithm provides more distinctive temporal forensic clues, and the FA module significantly improves performance by effectively fusing spatial and temporal information. The method also demonstrates robustness against online social network transmissions, although performance degradation is observed with severe compression.
Approach
The ST-DDL network extracts temporal features using a novel Anchor-Mesh Motion (AMM) algorithm, which models precise geometric movements of facial micro-expressions by gridding faces with landmarks as anchors. These temporal features are then fused with spatial features (from RGB frames) using a Fusion Attention (FA) module, built upon a Transformer architecture, to enhance feature interaction for both video-level detection and pixel-level localization.
Datasets
FF++ (for training); DFD, FFIW, ManualFake (for testing). ManualFake includes forgeries from DeepFaceLab (DFL), FSGAN, SimSwap, Reface, After Effects (AE), and OSN-transmitted versions (Facebook, Whatsapp, Tiktok, Wechat).
Model(s)
Spatial-Temporal Deepfake Detection and Localization (ST-DDL) network, Anchor-Mesh Motion (AMM) algorithm, Fusion Attention (FA) module (Transformer-based), HRNet (for RGB and motion encoders), RetinaFace (for face detection and landmark calibration), MLP (for classification).
Author countries
China