Generalizable Face Forgery Detection via Separable Prompt Learning

Authors: Enrui Yang, Yuezun Li

Published: 2026-04-19 07:51:34+00:00

AI Summary

This paper proposes Separable Prompt Learning (SePL), a novel strategy that enhances CLIP's capabilities for generalizable face forgery detection by shifting focus to the text modality. SePL disentangles forgery-specific and forgery-irrelevant information in images through two types of learnable prompts, guided by a cross-modality alignment strategy and dedicated objectives. This approach enables CLIP to serve as an effective face forgery detector, demonstrating strong generalizability across diverse evaluation settings.

Abstract

Detecting face forgeries using CLIP has recently emerged as a promising and increasingly popular research direction. Owing to its rich visual knowledge acquired through large-scale pretraining, most existing methods typically rely on the visual encoder of CLIP, while paying limited attention to the text modality. Given the instructive nature of the text modality, we posit that it can be leveraged to instruct Deepfake detection with meticulous design. Accordingly, we shift the focus from the visual modality to the text modality and propose a new Separable Prompt Learning strategy (SePL) that enables CLIP to serve as an effective face forgery detector. The core idea of SePL is to disentangle forgery-specific and forgery-irrelevant information in images via two types of prompt learning, with the former enhancing detection. To achieve this disentangle, we describe a cross-modality alignment strategy and a set of dedicated objectives. Extensive experiments demonstrate that, with this simple adaptation, our method achieves competitive and even superior performance compared to other methods under both cross-dataset and cross-method evaluation, highlighting its strong generalizability. The codes have been released at https://github.com/OUC-YER/SePL-DeepfakeDetection


Key findings
SePL achieves competitive and often superior performance compared to 17 state-of-the-art methods across both cross-dataset and cross-method evaluations, highlighting its strong generalizability. It demonstrates significant improvement in cross-method evaluation (e.g., ~8% on BleFace) and robust performance under various image perturbations. Ablation studies confirm the incremental contribution of each proposed loss term and the effectiveness of the text modality and cross-modality alignment in learning disentangled, discriminative forgery features.
Approach
The method introduces a Separable Prompt Learning (SePL) strategy that designs two types of learnable prompts to capture forgery-specific and forgery-irrelevant information. These prompts are encoded into text embeddings which, via a cross-modality alignment strategy based on cross-attention, guide the CLIP visual encoder to disentangle visual features. A two-stage training process with multiple loss functions ensures effective learning and separation of these representations for robust detection.
Datasets
FaceForensics++ (c23), Celeb-DF-v2, DFD, DFDC, DFDCP, WDF, UniFace, BleFace, MobSwap, e4s, FaceDan, FSGAN, InSwap, SimSwap, ProGAN, UniversalFakeDetect
Model(s)
CLIP (ViT-Large-Patch14 backbone), LoRA (Low-rank adaptation of large language models), SVD-based LoRA
Author countries
China