Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection

Authors: Jianfeng Liao, Yichen Wei, Raymond Chan Ching Bon, Shulan Wang, Kam-Pui Chow, Kwok-Yan Lam

Published: 2026-03-02 04:58:00+00:00

Comment: Accepted at ICDF2C 2025

AI Summary

This paper introduces Deepfake Forensics Adapter (DFA), a novel dual-stream framework designed for generalizable deepfake detection. DFA synergizes a pre-trained CLIP model with targeted forensics analysis through a Global Feature Adapter, a Local Anomaly Stream, and an Interactive Fusion Classifier. The framework achieves state-of-the-art performance and superior generalization capabilities against evolving deepfake threats, particularly on the challenging DFDC dataset.

Abstract

The rapid advancement of deepfake generation techniques poses significant threats to public safety and causes societal harm through the creation of highly realistic synthetic facial media. While existing detection methods demonstrate limitations in generalizing to emerging forgery patterns, this paper presents Deepfake Forensics Adapter (DFA), a novel dual-stream framework that synergizes vision-language foundation models with targeted forensics analysis. Our approach integrates a pre-trained CLIP model with three core components to achieve specialized deepfake detection by leveraging the powerful general capabilities of CLIP without changing CLIP parameters: 1) A Global Feature Adapter is used to identify global inconsistencies in image content that may indicate forgery, 2) A Local Anomaly Stream enhances the model's ability to perceive local facial forgery cues by explicitly leveraging facial structure priors, and 3) An Interactive Fusion Classifier promotes deep interaction and fusion between global and local features using a transformer encoder. Extensive evaluations of frame-level and video-level benchmarks demonstrate the superior generalization capabilities of DFA, particularly achieving state-of-the-art performance in the challenging DFDC dataset with frame-level AUC/EER of 0.816/0.256 and video-level AUC/EER of 0.836/0.251, representing a 4.8% video AUC improvement over previous methods. Our framework not only demonstrates state-of-the-art performance, but also points out a feasible and effective direction for developing a robust deepfake detection system with enhanced generalization capabilities against the evolving deepfake threats. Our code is available at https://github.com/Liao330/DFA.git


Key findings
DFA achieved state-of-the-art performance on the DFDC dataset, with a frame-level AUC/EER of 0.816/0.256 and a video-level AUC/EER of 0.836/0.251. This represents a 4.8% video AUC improvement over prior methods. An ablation study confirmed that each module (Global Adapter, Local Stream, Interactive Fusion Classifier) is critical for the model's robust performance and enhanced generalization.
Approach
DFA employs a dual-stream network built upon a frozen CLIP ViT-L/14 visual encoder. It uses a Global Feature Adapter to guide CLIP's attention to global forgery cues, a Local Anomaly Stream leveraging facial landmarks and a ResNeXt-50 backbone to perceive local facial anomalies. An Interactive Fusion Classifier then deeply fuses these global and local features using a transformer encoder for final classification.
Datasets
Celeb-DF-v1, Celeb-DF-v2, Deepfake Detection Challenge (DFDC), DFDCP, FaceForensics++ (FF++)
Model(s)
Deepfake Forensics Adapter (DFA), CLIP ViT-L/14, ResNeXt-50, Transformer Encoder
Author countries
China, Singapore, Hong Kong