EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

View on arXiv ← Back to list

Authors: Haoran Sun, Chen Cai, Huiping Zhuang, Kong Aik Lee, Lap-Pui Chau, Yi Wang

Published: 2025-10-18 10:34:05+00:00

AI Summary

This paper introduces EDVD-LLaMA, a novel multimodal large language model (MLLM) reasoning framework for Explainable Deepfake Video Detection (EDVD), which aims to provide accurate detection results alongside traceable reasoning explanations. The approach incorporates Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract rich spatio-temporal deepfake features and utilizes a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism to constrain reasoning using fine-grained facial metrics, thereby suppressing hallucinated outputs. EDVD-LLaMA demonstrates superior performance and robustness in detection accuracy and generalization capabilities compared to existing deepfake video detection methods and MLLMs.

Abstract

The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ benchmark dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The source code and dataset will be publicly available.

Key findings

EDVD-LLaMA achieved outstanding detection accuracy (84.75% Acc. and 87.64% AUC on ER-FF++set) significantly surpassing comparable MLLMs, and demonstrated superior generalization ability in cross-forgery and cross-dataset evaluations. Ablation studies confirmed that both the ST-SIT feature fusion module and the Fg-MCoT framework with structured facial constraints are critical components for maintaining high performance and generating reliable, explainable rationales.

Approach

The method uses Spatio-Temporal Subtle Information Tokenization (ST-SIT), combining a DSEncoder (Swin Transformer-based) for local features and a SigLip encoder with a Compact Visual Connector (CVC) for global features, fused via cross-attention. This visual input is fed into the MLLM (Qwen2.5-7B) using a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces structured facial landmark data (Mc, MΔ) as hard constraints to guide metric-grounded rationale generation and subsequent decision-making.

Datasets

ER-FF++set (Explainable Reasoning FF++ benchmark dataset, constructed from FaceForensics++), Celeb-DF, WildDeepfake.

Model(s)

Qwen2.5-7B (Large Language Model), Swin Transformer (in DSEncoder), SigLip (Encoder), Compact Visual Connector (CVC).

Author countries

Hong Kong SAR, Singapore, China

← Previous