Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

Authors: Alejandro Cobo, Roberto Valle, José Miguel Buenaposada, Luis Baumela

Published: 2025-12-03 19:00:07+00:00

AI Summary

The paper addresses the challenge of generalizing deepfake video detection to unseen manipulations by focusing on temporal artifacts beyond frame-to-frame instabilities. The authors propose a synthetic video generation method, KiMoI, that introduces subtle kinematic inconsistencies—violations of natural motion dependencies between facial regions—into pristine videos. A deepfake detector trained on this data achieves state-of-the-art generalization results across multiple benchmarks.

Abstract

Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.


Key findings
The hybrid training strategy combining spatial pseudo-fakes with the proposed KiMoI temporal artifacts achieved the highest average AUC, setting a new state-of-the-art in generalizable deepfake video detection. The LPN successfully generates semantically rich temporal artifacts that outperform analytical noise-based methods, particularly excelling on datasets dominated by temporal clues (like DFD and DFo). Performance scaled positively with larger detector backbones (ViT-L), demonstrating the robustness of the generated pseudo-fakes.
Approach
The proposed KiMoI framework uses a Landmark Perturbation Network (LPN), implemented as a transformer autoencoder, to decompose facial landmark configurations into motion bases. By introducing controlled Gaussian noise to the weights of these bases, the LPN generates sequences with subtle, localized kinematic inconsistencies. These manipulated landmark sequences are then applied to the original video frames using a face morphing pipeline to create pseudo-fake training samples, which are used alongside spatial pseudo-fakes (SBI) to train a detector.
Datasets
FaceForensics++ (FF++), CelebV-HQ, Celeb-DFv2 (CDF), DFD, DFDCP, WildDeepFake (WDF), DeeperForensics-1.0 (DFo), DF40
Model(s)
MARLIN encoder (ViT-B and ViT-L configurations), Transformer (as Landmark Perturbation Network, LPN)
Author countries
Spain