Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization

Authors: Midou Guo, Qilin Yin, Wei Lu, Xiangyang Luo, Rui Yang

Published: 2026-01-29 09:35:27+00:00

AI Summary

This paper introduces RT-DeepLoc, a weakly supervised framework for multimodal deepfake temporal localization. It identifies forgeries by detecting significant reconstruction discrepancies generated by a Masked Autoencoder (MAE) trained exclusively on authentic data. The framework employs a novel Asymmetric Intra-video Contrastive Loss (AICL) and Multi-task Learning Reinforcement (MTLR) to robustly leverage these cues for precise localization, achieving state-of-the-art performance.

Abstract

Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for localization. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.


Key findings
RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization on LAV-DF and AV-Deepfake1M datasets, significantly outperforming existing methods (e.g., 72.87% Avg. AP on LAV-DF compared to LOCO's 44.28%). It also demonstrates superior generalization ability in cross-dataset evaluations, even surpassing fully-supervised methods in unseen domains, confirming the robustness of modeling authentic data consistency.
Approach
RT-DeepLoc addresses weakly-supervised multimodal deepfake temporal localization by training a Masked Autoencoder (MAE) solely on authentic data. This MAE generates significant reconstruction errors for forged segments, serving as fine-grained forgery cues. These cues are then leveraged by a novel Asymmetric Intra-video Contrastive Loss (AICL) for robust feature separation and a Multi-task Learning Reinforcement (MTLR) strategy for stable, consistent predictions.
Datasets
LAV-DF, AV-Deepfake1M
Model(s)
The core model is RT-DeepLoc, which utilizes a Forgery Discovery Network based on a Masked Autoencoder (MAE) with Transformer encoders and decoders. It employs pre-trained TSN for visual feature extraction and Wav2Vec for audio feature extraction.
Author countries
China