FakeHunter: Multimodal Step-by-Step Reasoning for Explainable Video Forensics

View on arXiv ← Back to list

Authors: Chen Chen, Runze Li, Zejun Zhang, Pukun Zhao, Fanqing Zhou, Longxiang Wang, Haojian Huang

Published: 2025-08-20 10:03:31+00:00

AI Summary

FakeHunter is a multimodal deepfake detection framework that uses memory-guided retrieval, chain-of-thought reasoning, and tool-augmented verification for accurate and interpretable video forensics. It achieves 34.75% accuracy on the new X-AVFake benchmark, surpassing existing methods.

Abstract

FakeHunter is a multimodal deepfake detection framework that combines memory-guided retrieval, chain-of-thought (Observation-Thought-Action) reasoning, and tool-augmented verification to provide accurate and interpretable video forensics. FakeHunter encodes visual content using CLIP and audio using CLAP, generating joint audio-visual embeddings that retrieve semantically similar real exemplars from a FAISS-indexed memory bank for contextual grounding. Guided by the retrieved context, the system iteratively reasons over evidence to localize manipulations and explain them. When confidence is low, it automatically invokes specialized tools-such as zoom-in image forensics or mel-spectrogram inspection-for fine-grained verification. Built on Qwen2.5-Omni-7B, FakeHunter produces structured JSON verdicts that specify what was modified, where it occurs, and why it is judged fake. We also introduce X-AVFake, a benchmark comprising 5.7k+ manipulated and real videos (950+ min) annotated with manipulation type, region/entity, violated reasoning category, and free-form justification. On X-AVFake, FakeHunter achieves an accuracy of 34.75%, outperforming the vanilla Qwen2.5-Omni-7B by 16.87 percentage points and MiniCPM-2.6 by 25.56 percentage points. Ablation studies reveal that memory retrieval contributes a 7.75 percentage point gain, and tool-based inspection improves low-confidence cases to 46.50%. Despite its multi-stage design, the pipeline processes a 10-minute clip in 8 minutes on a single NVIDIA A800 (0.8x real-time) or 2 minutes on four GPUs (0.2x), demonstrating practical deployability.

Key findings

FakeHunter outperforms baselines like Qwen2.5-Omni-7B and MiniCPM-2.6 on the X-AVFake benchmark. Memory retrieval and tool-based inspection significantly improve accuracy. The pipeline is practically deployable, processing a 10-minute clip in 8 minutes on a single GPU or 2 minutes on four GPUs.

Approach

FakeHunter encodes video with CLIP and audio with CLAP, generating joint embeddings to retrieve similar real examples from a memory bank. It then uses chain-of-thought reasoning (Observation-Thought-Action) to identify manipulations, explaining them and using specialized tools for low-confidence cases.

Datasets

X-AVFake (a new benchmark with 5.7k+ manipulated and real videos, annotated with manipulation type, region/entity, violated reasoning category, and free-form justification)

Model(s)

Qwen2.5-Omni-7B (primarily), CLIP (for visual encoding), CLAP (for audio encoding), Grounded SAM 2 and ProPainter (for video manipulation), Seeing-and-Hearing model (for audio manipulation)

Author countries

China, USA, Hong Kong

← Previous