Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

View on arXiv ← Back to list

Authors: Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth, Chris Callison-Burch

Published: 2025-09-26 17:59:54+00:00

AI Summary

The paper introduces DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived deepfake traces in 3.3K AI-generated videos with bounding boxes, timestamps, and explanations. Using this new benchmark, the authors train multimodal language models (LMs) to act as reward models capable of mimicking human judgment and localization. Their resulting 7B model significantly outperforms leading SOTA baselines like GPT-5 on the integrated task of identifying, grounding, and explaining fake clues.

Abstract

Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension -- whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated -- has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.

Key findings

Their best 7B reward model, fine-tuned on DeeptraceReward, achieved 70.2% overall performance, surpassing GPT 5 by 34.7% in deepfake trace detection. They observed a consistent difficulty gradient: binary fake vs. real classification is substantially easier (99.4% accuracy) than fine-grained trace detection (70.2% overall). Within detection, difficulty increases from natural language explanations (easiest), to spatial grounding, to temporal localization (hardest criterion for all models).

Approach

The researchers created the DeeptraceReward dataset by collecting high-quality videos and employing expert annotators to provide fine-grained, spatiotemporal annotations of perceived deepfake traces categorized into nine major types. They then used Supervised Fine-Tuning (SFT) on two base Multimodal LLMs (VideoLLaMa 3 and Qwen 2.5 VL) to train a dedicated reward model capable of performing binary classification, spatial localization (bounding box), temporal localization (timestamps), and natural language explanation.

Datasets

DEEPTRACEREWARD, LLaVa-Video-178K

Model(s)

VideoLLaMA 3 (7B), Qwen 2.5 VL (7B), GPT 5 (baseline)

Author countries

USA

← Previous