DeCLIP: Decoding CLIP representations for deepfake localization

Authors: Stefan Smeu, Elisabeta Oneata, Dan Oneata

Published: 2024-09-12 17:59:08+00:00

Comment: Accepted at Winter Conference on Applications of Computer Vision (WACV) 2025

AI Summary

DeCLIP is introduced as a novel approach to localize deepfake manipulations by decoding representations from large self-supervised models like CLIP. It combines these pretrained features with a convolutional decoder, demonstrating improved generalization capabilities for detecting local manipulations, including those from challenging latent diffusion models. The method achieves high-accuracy localization and shows that training on LDM data enhances robustness and generalization.

Abstract

Generative models can create entirely new images, but they can also partially modify real images in ways that are undetectable to the human eye. In this paper, we address the challenge of automatically detecting such local manipulations. One of the most pressing problems in deepfake detection remains the ability of models to generalize to different classes of generators. In the case of fully manipulated images, representations extracted from large self-supervised models (such as CLIP) provide a promising direction towards more robust detectors. Here, we introduce DeCLIP, a first attempt to leverage such large pretrained features for detecting local manipulations. We show that, when combined with a reasonably large convolutional decoder, pretrained self-supervised representations are able to perform localization and improve generalization capabilities over existing methods. Unlike previous work, our approach is able to perform localization on the challenging case of latent diffusion models, where the entire image is affected by the fingerprint of the generator. Moreover, we observe that this type of data, which combines local semantic information with a global fingerprint, provides more stable generalization than other categories of generative methods.


Key findings
DeCLIP significantly improves generalization for deepfake localization, particularly on out-of-domain data and challenging latent diffusion models (LDMs), outperforming existing methods. Training on LDM-inpainted data was found to enhance robustness and generalization to other manipulation types. Optimal performance was achieved with larger convolutional decoders and by combining representations from both ViT-L/14 and ResNet-50 backbones.
Approach
DeCLIP leverages frozen features from large self-supervised image encoders (specifically CLIP's ViT-L/14 or ResNet-50) as input to a learned convolutional decoder. This decoder upscales the low-resolution encoded representations to produce a high-resolution manipulation localization mask for the input image.
Datasets
Dolos dataset, MS COCO (COCO-SD variant), AutoSplice dataset
Model(s)
CLIP (ViT-L/14 and ResNet-50 as image encoders), Convolutional decoder
Author countries
Romania