INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts

Authors: Anshul Bagaria

Published: 2025-11-27 11:43:50+00:00

AI Summary

INSIGHT is a unified vision-language framework designed for robust detection and transparent explanation of AI-generated images, specifically addressing performance degradation under extreme low-resolution conditions (e.g., 32x32). It integrates forensic super-resolution, multi-scale artifact localization, and CLIP-guided semantic alignment to identify subtle generative cues. A structured VLM reasoning pipeline (ReAct + Chain-of-Thought), verified by a multimodal judge, generates high-quality, trustworthy explanations for forensic decisions.

Abstract

The growing realism of AI-generated images produced by recent GAN and diffusion models has intensified concerns over the reliability of visual media. Yet, despite notable progress in deepfake detection, current forensic systems degrade sharply under real-world conditions such as severe downsampling, compression, and cross-domain distribution shifts. Moreover, most detectors operate as opaque classifiers, offering little insight into why an image is flagged as synthetic, undermining trust and hindering adoption in high-stakes settings. We introduce INSIGHT (Interpretable Neural Semantic and Image-based Generative-forensic Hallucination Tracing), a unified multimodal framework for robust detection and transparent explanation of AI-generated images, even at extremely low resolutions (16x16 - 64x64). INSIGHT combines hierarchical super-resolution for amplifying subtle forensic cues without inducing misleading artifacts, Grad-CAM driven multi-scale localization to reveal spatial regions indicative of generative patterns, and CLIP-guided semantic alignment to map visual anomalies to human-interpretable descriptors. A vision-language model is then prompted using a structured ReAct + Chain-of-Thought protocol to produce consistent, fine-grained explanations, verified through a dual-stage G-Eval + LLM-as-a-judge pipeline to minimize hallucinations and ensure factuality. Across diverse domains, including animals, vehicles, and abstract synthetic scenes, INSIGHT substantially improves both detection robustness and explanation quality under extreme degradation, outperforming prior detectors and black-box VLM baselines. Our results highlight a practical path toward transparent, reliable AI-generated image forensics and establish INSIGHT as a step forward in trustworthy multimodal content verification.


Key findings
INSIGHT achieved competitive detection robustness across highly degraded datasets (e.g., 78.0% AUROC on CIFAKE) while demonstrating statistically significant improvements in explanation quality, achieving the highest mean score (3.75) across G-Eval metrics, especially in specificity and groundedness. The multimodal verification layer successfully maintained a false-support rate below 12%, ensuring factual consistency and significantly reducing hallucinated forensic claims in the final reports.
Approach
The system first uses Hierarchical Forensic Super-Resolution (DRCT) to amplify subtle artifacts in low-resolution images. Grad-CAM driven attention and SLIC superpixel anchoring localize suspicious regions, which are then semantically scored using CLIP against forensic artifact descriptors. Finally, a Vision-Language Model is prompted using a ReAct + Chain-of-Thought structure to generate structured explanations, which are validated by an LLM-as-a-judge to minimize hallucinations.
Datasets
ProGAN–StyleGAN, DFDC, CASIA v2, SRA Synthetic Image Benchmark (Stable Diffusion v1–v3, SDXL, DALL·E-3), CIFAKE (32x32).
Model(s)
DRCT, CLIP, CNNs, ResNets, ViTs, MOLMO, InternVL2 8B, BLIP-2, LLaVA-7B (LLM-as-a-Judge), SLIC.
Author countries
India