OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Authors: Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

Published: 2026-02-11 09:41:36+00:00

Comment: 38 pages, DeepFake Detection

AI Summary

This paper introduces OmniVL-Guard, a unified framework for omnibus vision-language forgery detection and grounding, which addresses the challenges of interleaved text, images, and videos in real-world misinformation. It tackles the "difficulty bias" problem in multi-task optimization, where veracity classification overshadows fine-grained grounding, by proposing a balanced reinforcement learning approach. OmniVL-Guard achieves this through Self-Evolving CoT Generation and Adaptive Reward Scaling Policy Optimization (ARSPO) for balanced joint optimization and robust generalization.

Abstract

Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \\textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.


Key findings
OmniVL-Guard significantly outperforms state-of-the-art methods on in-domain benchmarks across diverse modalities, achieving substantial improvements in challenging fine-grained localization tasks (e.g., +37.79% in video temporal localization). The framework also exhibits superior zero-shot robust generalization on out-of-domain scenarios, demonstrating its ability to learn intrinsic forgery features. Ablation studies confirm the necessity and effectiveness of both Self-Evolving CoT Generation and the ARSPO components in achieving balanced and enhanced performance.
Approach
The authors propose OmniVL-Guard, a balanced reinforcement learning framework that leverages Self-Evolving CoT Generation to synthesize high-quality reasoning paths, overcoming the cold-start challenge. It then employs Adaptive Reward Scaling Policy Optimization (ARSPO), which dynamically modulates reward scales and task weights to ensure balanced joint optimization, particularly for fine-grained detection and grounding tasks.
Datasets
FSFR (Full-Spectrum Forensic Reasoning), FakeNewsCorpus, MCFEND, FakeClue, LOKI, ForgeryNet, GenVideo, DVF, SAMM, MDSM, DGM4, NewsCLIPpings. For OOD evaluation: ISOT, CASIA2.0, MMFakeBench, FakeSV.
Model(s)
Qwen3VL-8B (backbone), InternVL3.5-8B (ablation), Seed1.6-VL, Gemini3, ChatGPT5 (for CoT generation/verification). The RL method is based on SAPO and GRPO.
Author countries
China