Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images

Authors: Chuangchuang Tan, Xiang Ming, Jinglu Wang, Renshuai Tao, Bin Li, Yunchao Wei, Yao Zhao, Yan Lu

Published: 2025-10-11 14:09:24+00:00

AI Summary

This paper introduces the task of semantic anomaly detection and reasoning in AI-generated images, focusing on high-level inconsistencies like physical impossibilities or commonsense violations. They propose the large-scale benchmark AnomReason, featuring structured quadruple annotations (Name, Phenomenon, Reasoning, Severity), constructed efficiently using the modular multi-agent pipeline, AnomAgent. Fine-tuning models on AnomReason results in significant performance gains over strong vision-language baselines in both detection and reasoning capabilities.

Abstract

The rapid advancement of AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle \\textbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment. In this paper, we formalize \\textbf{semantic anomaly detection and reasoning} for AIGC images and introduce \\textbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples \\emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by a modular multi-agent pipeline (\\textbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17\\,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric (\\textit{SemAP} and \\textit{SemF1}). Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.


Key findings
Off-the-shelf VLMs struggle severely with structured semantic anomaly reasoning, often scoring below SemAPFull 0.42. The fine-tuned model, AnomReasonor-7B, achieved state-of-the-art performance on the proposed semantic matching metrics (SemAPFull=0.5162), surpassing open-source baselines and proprietary models like GPT-4o in reasoning quality. This framework successfully enables high-accuracy, interpretable deepfake detection and effective semantic reasonableness assessment for image generators.
Approach
The authors developed AnomAgent, a modular multi-agent framework powered by GPT-4o, to generate large-scale, structured semantic anomaly annotations (Name, Phenomenon, Reasoning, Severity Score) for AIGC images, verified by human-in-the-loop screening. They then fine-tune Vision-Language Models (VLMs) (specifically Qwen2.5-VL-7B) on this structured data, creating AnomReasonor-7B, designed to jointly identify anomalies and provide grounded explanations.
Datasets
AnomReason, AnomReason-Deepfake (constructed from Midjourney, SD3.5, Flux, and LAION/reLAION-HR)
Model(s)
Qwen2.5-VL-7B (fine-tuned into AnomReasonor-7B), Vision-Language Models (VLM), GPT-4o (for annotation pipeline)
Author countries
China