Deepfake Geography: Detecting AI-Generated Satellite Images

Authors: Mansur Yerzhanuly

Published: 2025-11-21 20:30:10+00:00

AI Summary

This study evaluates Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for the detection of AI-generated satellite imagery deepfakes, addressing challenges unique to geospatial data. The research demonstrates that ViTs achieve significantly higher accuracy (95.11%) and robustness compared to CNNs due to their ability to model global, long-range dependencies. They use architecture-specific interpretability methods (Grad-CAM and attention attribution) to validate detection behaviors related to structural inconsistencies and textural patterns.

Abstract

The rapid advancement of generative models such as StyleGAN2 and Stable Diffusion poses a growing threat to the authenticity of satellite imagery, which is increasingly vital for reliable analysis and decision-making across scientific and security domains. While deepfake detection has been extensively studied in facial contexts, satellite imagery presents distinct challenges, including terrain-level inconsistencies and structural artifacts. In this study, we conduct a comprehensive comparison between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for detecting AI-generated satellite images. Using a curated dataset of over 130,000 labeled RGB images from the DM-AER and FSI datasets, we show that ViTs significantly outperform CNNs in both accuracy (95.11 percent vs. 87.02 percent) and overall robustness, owing to their ability to model long-range dependencies and global semantic structures. We further enhance model transparency using architecture-specific interpretability methods, including Grad-CAM for CNNs and Chefer's attention attribution for ViTs, revealing distinct detection behaviors and validating model trustworthiness. Our results highlight the ViT's superior performance in detecting structural inconsistencies and repetitive textural patterns characteristic of synthetic imagery. Future work will extend this research to multispectral and SAR modalities and integrate frequency-domain analysis to further strengthen detection capabilities and safeguard satellite imagery integrity in high-stakes applications.


Key findings
The Vision Transformer (ViT-B/16) achieved a significantly higher test accuracy (95.11%) compared to the ResNet-50 CNN (87.02%), validating the hypothesis that global modeling is crucial for deepfake geography. ViTs excelled at detecting macro-level structural inconsistencies and repetitive textural patterns characteristic of synthetic imagery. Interpretability confirmed that ViTs focus on contextual reasoning across the entire image, whereas CNNs were limited to localized texture cues.
Approach
The authors compare pretrained ResNet-50 CNNs and ViT-B/16 Vision Transformers for binary deepfake classification, freezing the backbones and fine-tuning the classification heads. The models are trained on a large curated dataset of RGB satellite images to detect artifacts like texture repetitions and structural inconsistencies. Interpretability methods (Grad-CAM and Attention Rollout) are applied post-hoc to analyze architectural decision-making and validate model trust.
Datasets
DM-AER, FSI
Model(s)
ResNet-50 (CNN), ViT-B/16 (Vision Transformer)
Author countries
UNKNOWN