Improving Generalization in Deepfake Detection with Face Foundation Models and Metric Learning

Authors: Stelios Mylonas, Symeon Papadopoulos

Published: 2025-08-27 09:46:45+00:00

AI Summary

This research proposes a robust video deepfake detection framework that leverages face foundation models for enhanced generalization. The method utilizes a self-supervised model (FSFM) fine-tuned with an ensemble of deepfake datasets and incorporates triplet loss variants for improved embedding separability.

Abstract

The increasing realism and accessibility of deepfakes have raised critical concerns about media authenticity and information integrity. Despite recent advances, deepfake detection models often struggle to generalize beyond their training distributions, particularly when applied to media content found in the wild. In this work, we present a robust video deepfake detection framework with strong generalization that takes advantage of the rich facial representations learned by face foundation models. Our method is built on top of FSFM, a self-supervised model trained on real face data, and is further fine-tuned using an ensemble of deepfake datasets spanning both face-swapping and face-reenactment manipulations. To enhance discriminative power, we incorporate triplet loss variants during training, guiding the model to produce more separable embeddings between real and fake samples. Additionally, we explore attribution-based supervision schemes, where deepfakes are categorized by manipulation type or source dataset, to assess their impact on generalization. Extensive experiments across diverse evaluation benchmarks demonstrate the effectiveness of our approach, especially in challenging real-world scenarios.


Key findings
FSFM initialization significantly improves out-of-distribution and in-the-wild generalization. The 'Batch All' triplet loss variant consistently outperforms other variants and the baseline. Attribution-based training strategies showed limited improvement and struggled to generalize to in-the-wild scenarios.
Approach
The approach uses a face foundation model (FSFM) initialized with pre-trained weights and further fine-tuned on a diverse set of deepfake datasets. Triplet loss variants are incorporated during training to enhance the separability of embeddings between real and fake videos.
Datasets
FaceForensics++ (FF++), Celeb-DF, DFDC, FakeAVCeleb, ForgeryNet, DF40, WDF, ITW (internal in-the-wild dataset), Deepfake-Eval-2024
Model(s)
ViT-Base-16 architecture initialized with FSFM pre-trained weights.
Author countries
Greece