Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

Authors: Kartik Kuckreja, Parul Gupta, Muhammad Haris Khan, Abhinav Dhall

Published: 2026-02-23 11:08:46+00:00

Comment: CVPR-2026, Code is available here: https://github.com/KjAeRsTuIsK/DeepfakeJudge

AI Summary

The paper introduces DeepfakeJudge, a framework for scalable reasoning supervision and evaluation in deepfake detection. It integrates an out-of-distribution benchmark, a human-annotated dataset with visual reasoning labels, and VLM-based evaluation models. DeepfakeJudge, optimized via a bootstrapped generator-evaluator process, achieves 96.2% accuracy and high human correlation in reasoning assessment, establishing reasoning fidelity as a quantifiable dimension for trustworthy deepfake detection.

Abstract

Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2\\%, outperforming \\texttt{30x} larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9\\% percent pairwise agreement on the human-annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study shows that participants preferred the reasonings generated by our framework 70\\% of the time, in terms of faithfulness, groundedness, and usefulness, compared to those produced by other models and datasets. All of our datasets, models, and codebase are \\href{https://github.com/KjAeRsTuIsK/DeepfakeJudge}{open-sourced}.


Key findings
The DeepfakeJudge framework, particularly the reasoning-bootstrapped model (DeepfakeJudge-7B), achieved 96.2% accuracy on the proposed meta-evaluation benchmark, significantly outperforming larger baselines. The reasoning judge demonstrated very high correlation with human ratings (0.95 Pearson r, 98.9% pairwise agreement) and its generated rationales were preferred in 70% of user study cases. This establishes reasoning fidelity as a quantifiable and scalable dimension for evaluating and supervising interpretable deepfake detection.
Approach
The authors propose DeepfakeJudge, a framework that constructs a comprehensive benchmark for out-of-distribution deepfake detection and reasoning using images. It scales human annotations of visual artifacts and their corresponding textual explanations into structured reasoning supervision using an iterative generator-evaluator bootstrapping process. This process then fine-tunes Vision-Language Models (VLMs) to serve as reasoning judges capable of both pointwise (scoring) and pairwise (preference) evaluation of deepfake explanations.
Datasets
DeepfakeJudge (comprising DeepfakeJudge-Detect, DeepfakeJudge-Reason, DeepfakeJudge-Meta, DeepfakeJudge-Meta-Human), Open-Images V7, MultifakeVerse, SID-SET-Description, Community-Forensics, DD-VQA. Fake data is generated using Gemini, SeedDream, Gemini-Nano Banana, Flux-Kontext-Max, and Qwen-Edit-2509 models.
Model(s)
UNKNOWN
Author countries
UAE, Australia