Scaling Laws for Deepfake Detection

View on arXiv ← Back to list

Authors: Wenhao Wang, Longqi Cai, Taihong Xiao, Yuxiao Wang, Ming-Hsuan Yang

Published: 2025-10-18 03:08:10+00:00

AI Summary

This paper analyzes scaling laws for deepfake detection using a new massive dataset, ScaleDF, containing over 14 million images generated by 102 methods across 51 real domains. The study demonstrates that detection error follows predictable power-law decay as the diversity of real domains and deepfake generation methods increases, similar to scaling laws observed in LLMs. This suggests a data-centric approach to building robust deepfake detectors by focusing on diversifying training data.

Abstract

This paper presents a systematic study of scaling laws for the deepfake detection task. Specifically, we analyze the model performance against the number of real image domains, deepfake generation methods, and training images. Since no existing dataset meets the scale requirements for this research, we construct ScaleDF, the largest dataset to date in this field, which contains over 5.8 million real images from 51 different datasets (domains) and more than 8.8 million fake images generated by 102 deepfake methods. Using ScaleDF, we observe power-law scaling similar to that shown in large language models (LLMs). Specifically, the average detection error follows a predictable power-law decay as either the number of real domains or the number of deepfake methods increases. This key observation not only allows us to forecast the number of additional real domains or deepfake methods required to reach a target performance, but also inspires us to counter the evolving deepfake technology in a data-centric manner. Beyond this, we examine the role of pre-training and data augmentations in deepfake detection under scaling, as well as the limitations of scaling itself.

Key findings

The detection error follows a power-law decay when scaling the number of real domains or deepfake methods, showing no signs of saturation along these dimensions. However, the performance benefit saturates when merely increasing image quantity beyond 10 million without increasing diversity (double-saturating power law). Furthermore, training on ScaleDF achieved the best cross-benchmark generalization compared to existing datasets.

Approach

The methodology treats deepfake detection as a binary classification problem utilizing the Vision Transformer (ViT) architecture. The core innovation is empirically measuring model performance against increasing scales of data diversity (domains, methods) and quantity (images) to fit power-law and double-saturating power-law functions.

Datasets

ScaleDF (newly introduced), ImageNet-21K (pre-training), DeepFakeDetection, Celeb-DF V2, WildDeepFake, ForgeryNet, DeepFakeFace, DF40 (testing benchmarks).

Model(s)

Vision Transformer (ViT-Base, ViT-S, ViT-M, ViT-L, ViT-H), pre-trained CLIP and SigLIP 2 variants.

Author countries

Australia, UNKNOWN

← Previous