AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection

Authors: Zhipei Xu, Xuanyu Zhang, Qing Huang, Xing Zhou, Jian Zhang

Published: 2025-05-21 06:43:34+00:00

AI Summary

AvatarShield introduces a novel human-centric synthetic video detection framework that leverages Visual Reinforcement Learning (VRL) and Group Relative Policy Optimization (GRPO) to train Large Language Models (LLMs) for interpretable fake detection without dense textual supervision. The framework features a dual-encoder architecture, combining a discrete vision tower for high-level semantic inconsistencies and a residual extractor for fine-grained artifact analysis. It also presents FakeHumanVid, a new large-scale benchmark for human-centric synthetic video detection.

Abstract

Recent advances in Artificial Intelligence Generated Content have led to highly realistic synthetic videos, particularly in human-centric scenarios involving speech, gestures, and full-body motion, posing serious threats to information authenticity and public trust. Unlike DeepFake techniques that focus on localized facial manipulation, human-centric video generation methods can synthesize entire human bodies with controllable movements, enabling complex interactions with environments, objects, and even other people. However, existing detection methods largely overlook the growing risks posed by such full-body synthetic content. Meanwhile, a growing body of research has explored leveraging LLMs for interpretable fake detection, aiming to explain decisions in natural language. Yet these approaches heavily depend on supervised fine-tuning, which introduces limitations such as annotation bias, hallucinated supervision, and weakened generalization. To address these challenges, we propose AvatarShield, a novel multimodal human-centric synthetic video detection framework that eliminates the need for dense textual supervision by adopting Group Relative Policy Optimization, enabling LLMs to develop reasoning capabilities from simple binary labels. Our architecture combines a discrete vision tower for high-level semantic inconsistencies and a residual extractor for fine-grained artifact analysis. We further introduce FakeHumanVid, a large-scale benchmark containing 15K real and synthetic videos across nine state-of-the-art human generation methods driven by text, pose, or audio. Extensive experiments demonstrate that AvatarShield outperforms existing methods in both in-domain and cross-domain settings.


Key findings
AvatarShield significantly outperforms existing methods in both in-domain and cross-domain synthetic video detection tasks, achieving state-of-the-art performance. The method demonstrates strong generalization capabilities to unseen generative methods, attributed to its GRPO optimization and novel dual-encoder architecture with temporal compensation reward. Ablation studies confirmed the crucial contributions of the residual extractor, temporal compensation reward, and GRPO strategy to the overall performance.
Approach
The AvatarShield framework employs a dual-encoder architecture: a semantic extractor (ViT) for high-level temporal dynamics and a residual extractor (ViT on VQ-VAE residuals) for low-level artifacts. These visual features, along with text prompts, are fed into an LLM which is optimized using Group Relative Policy Optimization (GRPO). GRPO uses a suite of reward functions, including detection accuracy, temporal compensation, length, and format rewards, to guide the LLM's reasoning without requiring extensive textual annotations.
Datasets
FakeHumanVid (newly introduced, containing 15K real and synthetic videos generated by Kling, Hailuo, Wanx, CogVideo, StableAnimator, MimicMotion, ControlNeXt, Hallo3, HelloMeme), constructed using videos from TikTok and HDTF datasets.
Model(s)
Qwen2.5-VL-7B (Large Language Model), Vision Transformer (ViT) for semantic and residual feature extraction, VQ-VAE for residual computation.
Author countries
China