FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection

View on arXiv ← Back to list

Authors: Minji Heo, Simon S. Woo

Published: 2025-09-20 09:53:50+00:00

AI Summary

The paper introduces FakeChain, a large-scale benchmark dataset of multi-step deepfakes generated using various methods. Analysis reveals that deepfake detectors heavily rely on artifacts from the final manipulation step, leading to significant performance drops when the final generator differs from the training distribution, highlighting the need for more robust detection models.

Abstract

Multi-step or hybrid deepfakes, created by sequentially applying different deepfake creation methods such as Face-Swapping, GAN-based generation, and Diffusion methods, can pose an emerging and unforseen technical challenge for detection models trained on single-step forgeries. While prior studies have mainly focused on detecting isolated single manipulation, little is known about the detection model behavior under such compositional, hybrid, and complex manipulation pipelines. In this work, we introduce \\textbf{FakeChain}, a large-scale benchmark comprising 1-, 2-, and 3-Step forgeries synthesized using five state-of-the-art representative generators. Using this approach, we analyze detection performance and spectral properties across hybrid manipulation at different step, along with varying generator combinations and quality settings. Surprisingly, our findings reveal that detection performance highly depends on the final manipulation type, with F1-score dropping by up to \\textbf{58.83\\%} when it differs from training distribution. This clearly demonstrates that detectors rely on last-stage artifacts rather than cumulative manipulation traces, limiting generalization. Such findings highlight the need for detection models to explicitly consider manipulation history and sequences. Our results highlight the importance of benchmarks such as FakeChain, reflecting growing synthesis complexity and diversity in real-world scenarios. Our sample code is available here\\footnote{https://github.com/minjihh/FakeChain}.

Key findings

Deepfake detectors primarily rely on the final manipulation step, resulting in a significant F1-score drop (up to 58.83%) when the final generator type differs from the training distribution. The optimal training depth for deepfake detectors varies depending on the generation method used. GANs and Diffusion models overwrite previous frequency patterns, while FaceFusion preserves them.

Approach

FakeChain creates a benchmark dataset of multi-step deepfakes by sequentially applying five state-of-the-art deepfake generation methods (FaceFusion, StyleGAN3, StyleSwin, Stable Diffusion 3, Stable Diffusion XL). The authors analyze detection performance and spectral properties of these deepfakes to understand the limitations of current detection models.

Datasets

FakeChain (a new dataset created for this research), FFHQ-1024

Model(s)

Xception, F3Net, MAT

Author countries

South Korea

← Previous