Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline

View on arXiv ← Back to list

Authors: Rui Zuo, Qinyue Tong, Zhe-Ming Lu, Ziqian Lu

Published: 2025-11-17 14:49:57+00:00

AI Summary

Foresee is a novel, training-free pipeline that unlocks the inherent potential of vanilla Multimodal Large Language Models (MLLMs) for Image Forgery Detection and Localization (IFDL). It uses a type-prior-driven strategy and a Flexible Feature Detector (FFD) specifically for copy-move manipulations to enhance detection and localization accuracy. The approach delivers superior localization accuracy and richer textual explanations while eliminating the need for expensive, large-scale training.

Abstract

With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.

Key findings

The training-free Foresee pipeline achieves superior forgery localization accuracy, often surpassing the average performance of established trained methods across diverse manipulation types. The method exhibits strong generalization capabilities across editing, deepfake, and AIGC-editing datasets, demonstrating its intrinsic forensic potential. Furthermore, Foresee generates high-quality, comprehensive textual explanations, scoring highest across metrics like accuracy, detail, and hallucination compared to existing explainable models.

Approach

The training-free pipeline utilizes vanilla MLLMs guided by a chain-of-thought paradigm, starting with a type-prior-driven strategy to classify tampering (copy-move, deepfake, AIGC, others) and select category-specific prompts. For copy-move forgeries, a Flexible Feature Detector (FFD) generates a hint image to aid detection. The MLLM then generates a textual description of the tampered area, which guides GroundingDINO and SAM2 for precise segmentation and localization.

Datasets

CASIA1+, Columbia, Coverage, NIST16, IMD2020, FaceApp, OpenForensics

Model(s)

GPT-5, Gemini 2.5 Pro, GroundingDINO, SAM2, Qwen3-VL, Claude Sonnet 4

Author countries

China

← Previous