Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

View on arXiv ← Back to list

Authors: Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Xinze Li, Bingyu Zhu, Wuhui Duan, Congang Chen, Zeyu Fu, Yi Dong, Baoyuan Wu, Jason Li, Guangliang Cheng

Published: 2026-05-02 22:56:17+00:00

Comment: Accepted to CVPR 2026

AI Summary

This paper introduces Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings, comprising Omni-Fake-Set (1M+ samples) and Omni-Fake-OOD (200k+ samples). On top of this benchmark, the authors propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues for joint detection, localization, and natural-language explanations. Extensive experiments demonstrate significant gains in detection accuracy, cross-modal generalization, and explainability compared to state-of-the-art baselines.

Abstract

Multimodal deepfakes are proliferating on social media and threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their single-modality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. To address these limitations, we present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 200k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities (image, audio, video, and audio-video talking head) and supports a joint detection-localization-explanation protocol. On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines. Project page: https://tianxiao1201.github.io/omni-fake-project-page/

Key findings

Omni-Fake-R1 consistently outperforms state-of-the-art baselines across all four modalities (image, audio, video, AV talking head) in detection, localization, and explanation tasks on the Omni-Fake-Set. It demonstrates particularly strong cross-modal generalization and robustness on the Omni-Fake-OOD benchmark and under common social-media corruptions. The unified SFT and GSPO training strategy proves crucial for these gains and for producing coherent, interpretable outputs.

Approach

The authors propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector built upon Qwen2.5-Omni-7B. It is trained using a two-stage pipeline: curriculum supervised fine-tuning (SFT) with modal replay to incrementally introduce modalities and prevent catastrophic forgetting, followed by unified Group Sequence Policy Optimization (GSPO) reinforcement learning to optimize task-level rewards for detection, localization, and explanation across four modalities.

Datasets

Omni-Fake (comprising Omni-Fake-Set and Omni-Fake-OOD), So-Fake-Set, So-Fake-OOD, GenBuster-200K, SENORITA-2M, VideoPainter/VPBench, Multilingual LibriSpeech, PartialEdit, Common Voice, LlamaPartialSpoof, celebVHQ, Hallo3, HDTF, MAVOS, TalkVid/TalkVid-bench, FakeAVCeleb, TalkingHead-1KH.

Model(s)

Omni-Fake-R1 (custom reinforcement-learning-driven detector built on Qwen2.5-Omni-7B)

Author countries

UK, China, Singapore

← Previous