Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization

Authors: Ashutosh Anshul, Shreyas Gopal, Deepu Rajan, Eng Siong Chng

Published: 2025-11-13 11:34:03+00:00

AI Summary

The paper proposes a single-stage multimodal deepfake detection and temporal localization framework utilizing next-frame feature prediction for enhanced generalization. The approach captures inconsistencies by measuring discrepancies between predicted and actual features across both uni-modal and cross-modal representations. A window-level attention mechanism focuses on local artifacts, enabling robust classification and precise temporal localization of partially spoofed videos.

Abstract

Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.


Key findings
The single-stage model achieves competitive performance against state-of-the-art two-stage pretraining methods in intra-dataset and cross-manipulation generalization tasks, demonstrating strong robustness. It established a new state-of-the-art for temporal deepfake localization on the LAV-DF dataset, achieving a 19.82% AP gain over UMMAFormer at 95% IoU. The masked-prediction approach enhanced interpretability by highlighting manipulated modalities and temporal segments through feature difference heatmaps.
Approach
The method employs three masked-prediction modules (two uni-modal, one cross-modal) which use a causal transformer encoder/decoder architecture to predict next-frame features. Local convolution-based cross-attention measures the differences between predicted and actual frame features to detect inconsistencies. These discrepancy features are then integrated via alternating cross-attention and fed into separate prediction heads for classification or UMMAFormer-based temporal localization, guided by frame-level contrastive loss.
Datasets
FakeAVCeleb, VoxCeleb2, KoDF, LavDF, CREMA
Model(s)
ResNet-18 (Visual Encoder), ViT (Audio Encoder), Causal Transformer (Encoder/Decoder), UMMAFormer (Localization Head)
Author countries
Singapore