Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images

View on arXiv ← Back to list

Authors: Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, Rita Cucchiara

Published: 2023-04-02 10:25:09+00:00

AI Summary

This paper introduces a novel multimodal deepfake detection framework using text-to-image diffusion models. It analyzes the performance of contrastive and classification-based visual features for deepfake detection and proposes a contrastive-based disentanglement method to separate semantic and perceptual cues for improved accuracy. A new dataset, COCOFake, containing over 1.2M images, is also released.

Abstract

Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive analysis of the performance of contrastive and classification-based visual features, respectively extracted from CLIP-based models and ResNet or ViT-based architectures trained on image classification datasets. Our results demonstrate that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling method that lets us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 1.2M images generated from the original COCO image-caption pairs using two recent text-to-image diffusion models, namely Stable Diffusion v1.4 and v2.0.

Key findings

Contrastive-based features from models trained on image-text pairs effectively discriminate real and fake images. A contrastive-based disentanglement method allows for deepfake detection even when low-level perceptual cues are removed. The COCOFake dataset provides a benchmark for evaluating deepfake detection methods against state-of-the-art diffusion models.

Approach

The authors leverage a multimodal setting where different textual captions are used to generate multiple fake images from a single real image, forming semantic clusters. They evaluate pre-trained visual models (CLIP, ResNet, ViT) for deepfake detection and introduce a contrastive-based disentanglement method to separate low-level perceptual cues from semantic information for improved detection, especially against future, higher-quality generators.

Datasets

COCO, COCOFake (a new dataset created by the authors containing ~1.2M images generated from COCO using Stable Diffusion v1.4 and v2.0)

Model(s)

ResNet, ViT, CLIP, OpenCLIP (various configurations)

Author countries

Italy

← Previous