DinoLizer: Learning from the Best for Generative Inpainting Localization

Authors: Minh Thong Doi, Jan Butora, Vincent Itier, Jérémie Boulanger, Patrick Bas

Published: 2025-11-25 08:37:24+00:00

AI Summary

We introduce DinoLizer, a novel approach utilizing a frozen DINOv2 Vision Transformer backbone and a trained linear classification head for localizing manipulated regions generated by generative inpainting. The method uses patch embeddings and a sliding-window strategy to achieve highly granular and robust manipulation detection across images of arbitrary sizes. Empirical results demonstrate that DinoLizer significantly surpasses current state-of-the-art localization detectors across various datasets and remains robust against typical post-processing operations.

Abstract

We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a $14\\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12\\% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.


Key findings
DinoLizer achieved state-of-the-art performance in forgery localization, attaining the highest F1 score across almost all evaluated inpainting datasets. On average, the model demonstrated a 12% higher Intersection-over-Union (IoU) than the next best competing detector. Furthermore, DinoLizer exhibited superior resilience against common post-processing operations, including resizing, noise addition, and JPEG compression.
Approach
DinoLizer uses a frozen DINOv2-B backbone, pretrained for synthetic image detection, and adds a lightweight linear head trained on patch embeddings to predict manipulation probabilities at a $14\\times 14$ resolution. It employs a sliding-window inference strategy to aggregate predictions for large images while avoiding down-sampling, and uses Dice Loss specifically trained to label auto-encoded regions as pristine, focusing only on semantic alterations.
Datasets
B-Free (training and testing), Beyond the Brush (BtB), CocoGlide, TGIF, SAGI-SP, SAGI-FR.
Model(s)
DINOv2-B (Vision Transformer), Linear Classification Head, DINOv2, DINOv3.
Author countries
France