AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

View on arXiv ← Back to list

Authors: Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen, Fakhri Karray

Published: 2025-12-19 16:06:03+00:00

AI Summary

The paper introduces AdaptPrompt, a parameter-efficient transfer learning framework that leverages CLIP for generalizable deepfake detection across diverse generative models. This method jointly optimizes task-specific textual prompts and lightweight visual adapters while keeping the VLM backbone frozen. They also propose Diff-Gen, a large-scale benchmark of diffusion-generated fakes, demonstrating state-of-the-art performance in both standard and challenging cross-domain scenarios.

Abstract

Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework's versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.

Key findings

AdaptPrompt_v2 achieved state-of-the-art generalization with an average AP of 98.60% across 25 diverse test sets, significantly closing the performance gap between GANs, diffusion models, and commercial tools. Training on the novel Diff-Gen benchmark, which captures broad spectral artifacts, was found superior to GAN-based training for universal detection. The ablation study confirmed that pruning the final semantic transformer block of the visual encoder enhances the retention of high-frequency generative artifacts, boosting overall accuracy.

Approach

AdaptPrompt adapts the frozen CLIP backbone using two parameter-efficient components: a visual adapter network (residual bottleneck) injected into the visual stream, and learnable context vectors for textual prompt tuning. They also propose an architectural modification (AdaptPrompt_v2) involving pruning the final CLIP transformer block to preserve high-frequency generative artifacts for enhanced detection accuracy.

Datasets

Diff-Gen (New, 100k diffusion fakes), ProGAN, LSUN, BigGAN, CycleGAN, EG3D, GauGAN, StarGAN, StyleGAN, StyleGAN2, StyleGAN3, Taming-T, Glide, LDM, Stable Diffusion (SD), SDXL, MidJourney-V5, Adobe Firefly, DALL-E 3, DALL-E mini, Deepfakes, FaceSwap.

Model(s)

CLIP (Vision-Language Model), specifically CLIP-ViT backbone, adapted using Adapter Networks and Prompt Tuning.

Author countries

Canada, UAE, Norway

← Previous