LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection

Authors: Qingyuan Liu, Yun-Yun Tsai, Ruijian Zha, Victoria Li, Pengyuan Shi, Chengzhi Mao, Junfeng Yang

Published: 2025-02-20 19:34:58+00:00

AI Summary

This paper proposes LAVID, a novel agentic framework for detecting diffusion-generated videos using Large Vision Language Models (LVLMs). LAVID enhances LVLMs' capabilities through explicit knowledge extraction via external tools and improves reasoning by adaptively adjusting structured prompts with a self-rewriting mechanism. The method is fully training-free and significantly outperforms existing baselines.

Abstract

The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works of AI-generated content detection have been widely studied in the image field (e.g., deepfake), yet the video field has been unexplored. Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection for its strong reasoning and multimodal capabilities. It breaks the limitations of traditional deep learning based methods faced with like lack of transparency and inability to recognize new artifacts. Motivated by this, we propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement. Our insight list as follows: (1) The leading LVLMs can call external tools to extract useful information to facilitate its own video detection task; (2) Structuring the prompt can affect LVLM's reasoning ability to interpret information in video content. Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting. Different from prior SOTA that trains additional detectors, our method is fully training-free and only requires inference of the LVLM for detection. To facilitate our research, we also create a new benchmark \\vidfor with high-quality videos generated from multiple sources of video generation tools. Evaluation results show that LAVID improves F1 scores by 6.2 to 30.2% over the top baselines on our datasets across four SOTA LVLMs.


Key findings
LAVID consistently improves F1 scores by 6.2% to 30.2% over top baselines across four state-of-the-art LVLMs on the high-quality VidForensic dataset. The framework demonstrates that structured prompts significantly enhance LVLM visual capabilities and eliminate hallucination. Moreover, LAVID outperforms traditional supervised learning methods (SVM, XGBoost) and achieves competitive results in Deepfake detection, highlighting its generalizability and training-free advantage.
Approach
LAVID tackles video deepfake detection by leveraging LVLMs, enabling them to call external tools to extract explicit knowledge (EK) such as optical flow and depth maps from videos. It automatically selects the most useful EK tools for each LVLM based on a defined metric and adaptively adjusts structured prompts through online adaptation and self-rewriting, thereby enhancing reasoning and mitigating hallucination.
Datasets
VidForensic (a new benchmark created by the authors, comprising videos from PANDA-70M, VidProM, Text2Video-Zero, VideoCrafter2, ModelScope, Pika, Self-Collected Youtube, SORA, OpenSORA, Kling, and Runway-Gen3), Celeb-DF-v1.
Model(s)
Llava-OV-7B, Qwen-VL-Max, Gemini-1.5-pro, GPT-4o (Large Vision Language Models), SVM, XGBoost (for comparison).
Author countries
United States