Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach

Authors: Fan Nie, Jiangqun Ni, Jian Zhang, Bin Zhang, Weizhe Zhang, Bin Li

Published: 2025-11-24 13:20:03+00:00

AI Summary

This paper introduces FoVB (Forgery-aware Audio-Visual Adaptation with Variational Bayes), a novel framework for generalizable multi-modal deepfake detection. The method reformulates audio-visual correlation learning using variational Bayesian estimation, approximating the correlation as a Gaussian distributed latent variable. FoVB incorporates mechanisms to capture and disentangle intra-modal and cross-modal forgery traces efficiently, outperforming existing state-of-the-art detectors.

Abstract

The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.


Key findings
The proposed FoVB framework demonstrates state-of-the-art performance, achieving high generalization ability on cross-manipulation and cross-dataset benchmarks (e.g., +5.5% AP on RVFA in cross-manipulation tests). The disentanglement provided by the factorized latent variables significantly enhances the model's ability to identify intrinsic forgery artifacts, resulting in high robustness against various unseen perturbations.
Approach
FoVB adapts pre-trained backbones using two main modules: Global-Local Forgery-aware Adaptation (GLFA) uses difference convolutions and a high-pass filter to extract intra-modal forgery traces. Variational Bayesian Forgery Estimation (VBFE) estimates the audio-visual correlation as a Gaussian latent variable, which is then factorized into modality-specific and correlation-specific components using an orthogonality constraint to reduce entanglement.
Datasets
FakeAVCeleb, KoDF, DeAVMiT, DFDC, LAV-DF, IDForge
Model(s)
ViT (Vision Transformer), SwinT-large (tested in ablations)
Author countries
China