Venus-DeFakerOne: Unified Fake Image Detection & Localization

Authors: GuangJian Team

Published: 2026-05-13 20:20:33+00:00

AI Summary

This paper introduces DeFakerOne, a data-centric, unified foundation model for Fake Image Detection and Localization (FIDL). It integrates InternVL2 and SAM2 to simultaneously perform image-level authenticity detection and pixel-level forgery localization across diverse forgery scenarios. DeFakerOne achieves state-of-the-art performance on numerous detection and localization benchmarks, demonstrating robust generalization against both real-world perturbations and advanced generative models like GPT-Image-2.

Abstract

In recent years, the rapid evolution of generative AI has fundamentally reshaped the paradigm of image forgery, breaking the traditional boundaries between document editing, natural image manipulation, DeepFake generation, and full-image AIGC synthesis. Despite this shift toward unified forgery generation, existing research in Fake Image Detection and Localization (FIDL) remains fragmented. This creates a mismatch between increasingly unified forgery generation mechanisms and the domain-specific detection paradigm. Bridging this mismatch poses two key challenges for FIDL: understanding cross-domain artifacts transfer and interference, and building a high-capacity unified foundation model for joint detection and localization. To address these challenges, we propose DeFakerOne, a data-centric, unified FIDL foundation model integrating InternVL2 and SAM2. DeFakerOne enables simultaneous image-level detection and pixel-level forgery localization across diverse scenarios. Extensive experiments demonstrate that DeFakerOne achieves state-of-the-art performance, outperforming baselines on 39 forgery detection benchmarks and 9 localization benchmarks. Furthermore, the model exhibits superior robustness against real-world perturbations and state-of-the-art generators such as GPT-Image-2. Finally, we provide a systematic analysis of data scaling laws, cross-domain artifacts transfer-interference patterns, the necessity of fine-grained supervision, and the original resolution artifacts preservation, highlighting the design principles for scalable, robust, and unified FIDL.


Key findings
DeFakerOne achieved state-of-the-art results on 39 forgery detection and 9 localization benchmarks, including strong performance against GPT-Image-2 generated images and superior robustness to common image perturbations. Key findings emphasize that effective unified FIDL requires deliberate balancing of data composition (not just scaling), awareness of operation-level artifacts, multi-granularity supervision (especially pixel-level for local manipulations), and visual backbones that preserve original resolution artifacts.
Approach
DeFakerOne employs a cascaded architecture consisting of an InternVL2-based Multimodal Large Language Model (MLLM) for coarse-grained image-level detection and a SAM2-based module for fine-grained pixel-level localization. The MLLM, guided by dynamic VQA templates, provides authenticity judgments and generates segmentation tokens that serve as prompts for the SAM2 decoder to pinpoint manipulated regions.
Datasets
The model is trained on a curated dataset of 12.5M samples covering DeepFake, AIGC, Document, and Nature domains, including public datasets (e.g., FF++, CelebDF-v2, DiffusionForensics, GenImage, DocTamper, CASIA-v2, COCO_2017) and private real-world data. Evaluation is conducted on 39 forgery detection benchmarks, 9 localization benchmarks, OpenMMsec, and a custom GPT-Image-2-Bench.
Model(s)
UNKNOWN
Author countries
China