GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

Authors: Yaning Zhang, Linlin Shen, Zitong Yu, Chunjie Ma, Zan Gao

Published: 2026-03-31 05:59:59+00:00

AI Summary

This paper introduces GazeCLIP, a novel gaze-guided CLIP model with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). GazeCLIP leverages observed distribution differences between pristine and forged gaze vectors to enhance generalization to unseen face forgery attacks. It achieves state-of-the-art performance on a new benchmark encompassing advanced generative models like diffusion and flow models.

Abstract

Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.


Key findings
GazeCLIP significantly outperforms state-of-the-art methods, achieving an average performance improvement of 6.56% ACC and 5.32% AUC for attribution and detection, respectively, on novel generators. The integration of gaze features and adaptive-enhanced language prompts notably boosts generalization to unseen deepfake attacks and in-the-wild datasets. Ablation studies confirm the critical role of each proposed module and the effectiveness of the adaptive-enhanced word selector across various transformer-based vision-language models.
Approach
The approach utilizes a gaze-aware model based on CLIP, consisting of a visual perception encoder (VPE) to mine global forgery embeddings across appearance and gaze domains, a gaze-aware image encoder (GIE) that fuses gaze prompts with forged image embeddings, and a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector (AWS) for precise vision-language matching. This fusion enhances the generalization capabilities to unseen deepfake generators.
Datasets
GenFace, DF40, CelebA-HQ, Celeb-DF++, FFHQ, WildDeepfake, Celeb-DF, DFDC
Model(s)
GazeCLIP (based on CLIP), pre-trained gaze estimator (ETH-Xgaze), ResNet, Xception, ViT, CViT, CAEL, MFCLIP, DNA-Det, DE-FAKE, ForensicsAdapter (FA), OmniDFA, CDAL
Author countries
China