SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

View on arXiv ← Back to list

Authors: Sahibzada Adil Shahzad, Ammarah Hashmi, Junichi Yamagishi, Yusuke Yasuda, Yu Tsao, Chia-Wen Lin, Yan-Tsung Peng, Hsin-Min Wang

Published: 2026-03-26 08:01:35+00:00

AI Summary

This paper introduces SAVe, a self-supervised audio-visual deepfake detection framework that learns solely from authentic videos. SAVe generates identity-preserving, region-aware pseudo-manipulations to emulate visual tampering artifacts and models lip-speech synchronization to detect temporal misalignment. This approach enables robust detection without reliance on curated synthetic forgeries, mitigating dataset and generator bias.

Abstract

Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.

Key findings

SAVe demonstrates competitive in-domain performance and strong cross-dataset generalization, outperforming visual-only and audio-visual baselines, especially under varying compression. The combination of region-specific visual artifact learning and audio-visual synchronization cues proves more stable and generalizable than individual components. The self-supervised approach effectively narrows the performance gap with fully supervised methods without using any synthetic deepfake training data.

Approach

SAVe solves the problem by learning authenticity cues from real videos through two main mechanisms. It employs a Self-Supervised Visual Pseudo-Forgery Generator (SS-VPFG) to create region-wise pseudo-manipulations (FaceBlend, LipBlend, LowerFaceBlend) for training visual artifact detection branches. Simultaneously, an AVSync module leverages AV-HuBERT to detect temporal lip-speech misalignment, with all branches' predictions fused for the final decision.

Datasets

FakeAVCeleb, AV-LipSync-TIMIT

Model(s)

AV-HuBERT, Multilayer Perceptron (MLP) for AVSync, generic feature extractor (following [14]) for visual branches

Author countries

Taiwan, Japan

← Previous