ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

View on arXiv ← Back to list

Authors: Kaede Shiohara, Toshihiko Yamasaki, Vladislav Golyanik

Published: 2026-01-05 18:59:54+00:00

AI Summary

This paper introduces ExposeAnyone, a fully self-supervised framework for robust zero-shot face forgery detection using personalized audio-to-expression diffusion models. The model, EXAM, is pre-trained on real videos and then personalized using subject-specific adapters. Detection leverages the identity discrepancy quantified through content-agnostic diffusion reconstruction errors.

Abstract

Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision. In this paper, we propose ExposeAnyone, a fully self-supervised approach based on a diffusion model that generates expression sequences from audio. The key idea is, once the model is personalized to specific subjects using reference sets, it can compute the identity distances between suspected videos and personalized subjects via diffusion reconstruction errors, enabling person-of-interest face forgery detection. Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.

Key findings

ExposeAnyone achieved a state-of-the-art average AUC of 95.22% across traditional deepfake benchmarks, surpassing the previous best method by 4.22 percentage points. The method demonstrated high robustness to corruptions like blur and compression, and proved capable of detecting highly challenging Sora2-generated videos (94.44% AUC on S2CFP), where prior methods failed to generalize effectively.

Approach

The core approach involves pre-training an Audio-to-Expression Diffusion Model (EXAM) to map audio (Wav2Vec 2.0) to 3DMM facial expression coefficients extracted from video frames. The model is then personalized to specific subjects using subject-specific adapter tokens. Deepfakes are exposed by calculating a content-agnostic authentication score based on the ratio of reconstruction distances computed with and without the identity adapter.

Datasets

VoxCeleb2, AVSpeech, Acappella (for training); DF-TIMIT, DFDCP, KoDF, IDForge, Sora2 Cameo Forensics Preview (S2CFP) (for evaluation).

Model(s)

EXAM (ExposeAnyone Model), based on Diffusion Transformer (DiT) and Denoising Diffusion Probabilistic Models (DDPM), utilizing Time- and feature-wise Linear Modulation (TiLM) and Wav2Vec 2.0 for audio encoding. Uses FLAME/SPECTRE for 3D face representation.

Author countries

Japan, Germany

← Previous