AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization

Authors: Christos Koutlis, Symeon Papadopoulos

Published: 2025-11-24 11:19:21+00:00

AI Summary

This work introduces AuViRe (Audio-Visual Speech Representation Reconstruction), a novel approach for temporal deepfake localization. AuViRe leverages amplified discrepancies generated when reconstructing speech representations from one modality based on the other, providing robust cues for forgery detection. The method achieves state-of-the-art performance on major benchmarks, including +8.9 AP@0.95 on LAV-DF and +9.6 AP@0.5 on AV-Deepfake1M.

Abstract

With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 AP@0.95 on LAV-DF, +9.6 AP@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.


Key findings
AuViRe achieved state-of-the-art results for Temporal Forgery Localization (TFL) on both LAV-DF and AV-Deepfake1M datasets, showing significant gains over previous methods. Although trained for localization, the model also achieved near-perfect video-level deepfake detection (e.g., 99.94 AUC on LAV-DF). Additionally, AuViRe demonstrated superior robustness and generalization compared to competitors in challenging real-world video scenarios.
Approach
AuViRe first extracts robust audio and visual speech representations using a frozen pre-trained model like AV-Hubert. A reconstruction module then attempts cross-modal and unimodal reconstruction of these features, resulting in significant reconstruction errors (discrepancies) in manipulated segments. These errors are processed by a Reconstruction-Discrepancy Encoder and fed to classification and regression heads for fine-grained temporal localization.
Datasets
LAV-DF, AV-Deepfake1M, in-the-wild curated dataset
Model(s)
AV-Hubert (as backbone feature extractor), CNN/DeCNN (1D convolutional architectures for reconstruction and encoding modules)
Author countries
Greece