PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

Authors: Tuan Nguyen, Naseem Khan, Khang Tran, NhatHai Phan, Issa Khalil

Published: 2025-09-30 13:56:05+00:00

AI Summary

The paper introduces a method to improve multimodal LLMs (MLLMs) for deepfake detection by addressing hallucination and visual misalignment in explanations. They create the DF-R5 reasoning-annotated dataset and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm. PRPO aligns LLM reasoning with visual evidence at the paragraph level to produce more reliable and interpretable deepfake detection results.

Abstract

The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.


Key findings
PRPO achieved a state-of-the-art average F1 score of 89.91% across five generative domains in deepfake detection, significantly surpassing the DX-LLaVA baseline (78.08%) and strong MLLMs like Gemini-2.5 (80.31%). Furthermore, PRPO attained the highest reasoning quality score of 4.55/5.0, demonstrating superior faithfulness and visual grounding in its explanations.
Approach
They fine-tune the DX-LLaVA architecture (CLIP ConvNeXT encoder and Vicuna LLM) on the new DF-R5 dataset. They apply PRPO, a test-time reinforcement learning algorithm, which uses novel reward functions: Visual Consistency Reward (VCR) for visual grounding and Prediction Consistency Reward (PCR) for internal consistency between reasoning paragraphs and the final classification.
Datasets
DF-R5 (newly introduced, 115k images), DF40 (base images, covering DDIM, PixArt-α, SD-2.1, SiT, StyleGAN3), FaceForensics++, CelebDF.
Model(s)
DX-LLaVA (based on LLaVA), CLIP ConvNeXT (vision encoder), Vicuna (language model), PRPO (Reinforcement Learning algorithm).
Author countries
Qatar, USA