Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes

Authors: Zihe Wei, Yuezun Li

Published: 2026-04-10 08:42:31+00:00

AI Summary

This paper introduces AVPF (Audio-Visual Pseudo-Fakes), a novel method to significantly improve the generalizability of video deepfake detection. AVPF generates diverse pseudo-fake training samples solely from authentic videos by simulating common audio-visual correspondence patterns found in real deepfakes. This approach, which requires no actual deepfake samples for training, leads to an average performance improvement of up to 7.4% across multiple standard datasets.

Abstract

Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed architectures with publicly available datasets. While they have shown promising results, their effectiveness often degrades in real-world scenarios, as the limited diversity of training datasets naturally restricts generalizability to unseen cases. To address this, we propose a simple yet effective method, called AVPF, which can notably enhance model generalizability by training with self-generated Audio-Visual Pseudo-Fakes.The key idea of AVPF is to create pseudo-fake training samples that contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes. We highlight that AVPF is generated solely from authentic samples, and training relies only on authentic data and AVPF, without requiring any real deepfakes.Extensive experiments on multiple standard datasets demonstrate the strong generalizability of the proposed method, achieving an average performance improvement of up to 7.4%.


Key findings
The proposed AVPF method significantly enhances deepfake detection generalizability, achieving an average performance improvement of up to 7.4% across multiple standard datasets. It demonstrates performance gains of 6.7% in AUC and 8.0% in AP compared to the AVH-Align method. AVPF also exhibits superior robustness against various post-processing operations like JPEG compression, Gaussian blur, Gaussian noise, pixelation, and color inversion, capturing more robust forgery cues.
Approach
The method addresses detection generalizability by generating Audio-Visual Pseudo-Fakes (AVPF) from authentic videos, which mimic diverse audio-visual inconsistencies present in real deepfakes. It uses two strategies: Audio-Visual Self-Blending (AVSB) to introduce inter-modality inconsistencies via temporal shifts in either audio or visual streams, and Audio-Visual Self-Splicing (AVSS) to create intra-modality temporal inconsistencies within each modality. These pseudo-fakes serve as negative training data alongside authentic videos.
Datasets
VoxCeleb2, FakeAVCeleb, AV-Deepfake1M, AVLips, TalkingHeadBench
Model(s)
AV-HuBERT
Author countries
China