Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

Authors: Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen, Zhangling Duan, Zhaohong Jia

Published: 2026-03-18 14:22:45+00:00

AI Summary

This paper introduces Semantic Consistent Evidence Pack (SCEP), a training-free framework for cross-domain image deepfake detection using Large Vision-Language Models (LVLMs). SCEP replaces whole-image inference with evidence-driven reasoning by mining a compact set of suspicious patch tokens. It achieves this by combining CLS-guided semantic mismatch with frequency and noise anomalies, conditioning a frozen LVLM for robust prediction without fine-tuning.

Abstract

Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder's CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.


Key findings
SCEP consistently outperforms strong baselines across diverse image deepfake detection benchmarks without requiring LVLM fine-tuning. The framework demonstrates substantial gains on AI-generated and AI-edited images, attributed to its ability to provide more discriminative evidence via fused semantic, frequency, and noise cues. Furthermore, SCEP reduces inference latency across various frozen LVLMs, showcasing improved efficiency alongside enhanced detection accuracy.
Approach
The SCEP framework processes an input image by feeding it into a frozen vision encoder to obtain a CLS token and patch embeddings. It then clusters these patch features into coherent semantic groups and assigns suspiciousness scores to patches within each cluster, fusing CLS-guided semantic discrepancy with frequency and noise anomalies. A compact 'evidence pack' is subsequently formed by selecting top-ranked patches per cluster and applying grid-based Non-Maximum Suppression (NMS), which then conditions a frozen LVLM for the final deepfake prediction.
Datasets
DFBench, Playground, SD3.5 Large, PixArt-Sigma, Infinity, Kandinsky-3, Flux Schnell, Kolors, SD3 Medium, Flux Dev, NOVA, LaVi-Bridge, Janus, LIVE, CSIQ, TID2013, KADID, KonIQ-10k
Model(s)
UNKNOWN
Author countries
China, United Arab Emirates