AI-Generated Image Detection: An Empirical Study and Future Research Directions

Authors: Nusrat Tasnim, Kutub Uddin, Khalid Mahmood Malik

Published: 2025-11-04 18:13:48+00:00

AI Summary

This paper introduces a unified benchmarking framework for systematically evaluating AI-generated image detection methods under controlled and reproducible conditions. The study benchmarks ten state-of-the-art forensic methods across seven public datasets (GAN and diffusion) using multiple metrics and interpretability techniques like Grad-CAM. The findings reveal substantial variability in generalization capabilities, underscoring the limitations of current forensic approaches and guiding future research toward more robust solutions.

Abstract

The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.


Key findings
Evaluations showed substantial variability in performance, with certain methods demonstrating strong in-distribution results but degraded cross-model transferability. C2PClip consistently achieved high accuracy, while generalization across diverse, unseen generative models like the MNW dataset proved highly challenging for most methods. The study also highlighted biases in decision-making among models, often skewing results toward either the real or deepfake class.
Approach
The authors established a unified benchmarking framework to systematically evaluate ten SoTA image detection methods across seven diverse GAN and diffusion datasets. They tested methods under three training protocols (scratch, frozen, and fine-tuned) and utilized multiple metrics (ACC, AP, ROC-AUC) and explainability tools (Grad-CAM, confidence curves) to analyze generalization and robustness.
Datasets
ForenSyn, ForenSynthsCh, Diffusion1KStep, DIRE, GAN, UClipiffusion, MNW (Microsoft Northwestern Witness)
Model(s)
CNND, LGrad, NPR, UClip (based on CLIP), RClip (based on CLIP), FatF (FatFormer), RINE (based on CLIP), UpConv, FreqNet, C2PClip (based on CLIP)
Author countries
South Korea, USA