SEED: A Large-Scale Benchmark for Provenance Tracing in Sequential Deepfake Facial Edits

Authors: Mengieong Hoi, Zhedong Zheng, Ping Liu, Wei Liu

Published: 2026-04-12 08:27:17+00:00

AI Summary

This paper introduces SEED (Sequential Editing in Diffusion), a large-scale benchmark dataset comprising over 90,000 images for provenance tracing in sequential deepfake facial edits generated by diffusion models, complete with fine-grained annotations of edit order, attributes, and masks. The authors also propose FAITH (Frequency-Aware Identification Transformer), a baseline model that effectively aggregates spatial and frequency-domain cues to identify and order latent editing events, demonstrating superior performance in tracing complex editing histories.

Abstract

Deepfake content on social networks is increasingly produced through multiple \\emph{sequential} edits to biometric data such as facial imagery. Consequently, the final appearance of an image often reflects a latent chain of operations rather than a single manipulation. Recovering these editing histories is essential for visual provenance analysis, misinformation auditing, and forensic or platform moderation workflows that must trace the origin and evolution of AI-generated media. However, existing datasets predominantly focus on single-step editing and overlook the cumulative artifacts introduced by realistic multi-step pipelines. To address this gap, we introduce Sequential Editing in Diffusion (\\textbf{SEED}), a large-scale benchmark for sequential provenance tracing in facial imagery. SEED contains over 90K images constructed via one to four sequential attribute edits using diffusion-based editing pipelines, with fine-grained annotations including edit order, textual instructions, manipulation masks, and generation models. These metadata enable step-wise evidence analysis and support forgery detection, sequence prediction. To benchmark the challenges posed by SEED, we evaluate representative analysis strategies and observe that spatial-only approaches struggle under subtle and distributed diffusion artifacts, especially when such artifacts accumulate across multiple edits. Motivated by this observation, we further establish \\textbf{FAITH}, a frequency-aware Transformer baseline that aggregates spatial and frequency-domain cues to identify and order latent editing events. Results show that high-frequency signals, particularly wavelet components, provide effective cues even under image degradation. Overall, SEED facilitates systematic study of sequential provenance tracing and evidence aggregation for trustworthy analysis of AI-generated visual content.


Key findings
The study found that existing spatial-only deepfake detectors struggle with sequential provenance tracing, especially as the number of edits increases and artifacts accumulate. Their proposed FAITH model, which incorporates frequency-domain signals, achieves stable improvements across various sequence lengths and exhibits enhanced robustness against post-processing degradations like JPEG compression and Gaussian noise, with wavelet components proving most effective for tracing subtle edit cues.
Approach
The authors address the challenge of provenance tracing in multi-step deepfake facial edits by first creating SEED, a large-scale benchmark of diffusion-based sequential facial edits. To evaluate this benchmark, they propose FAITH, a frequency-aware Transformer architecture that combines spatial features extracted by a ResNet-50 backbone with high-frequency cues from Discrete Wavelet Transform (DWT), injecting these frequency cues as an additive bias into the decoder's cross-attention for sequential attribute prediction.
Datasets
SEED (Sequential Editing in Diffusion), FFHQ, CelebAMask-HQ
Model(s)
UNKNOWN
Author countries
China, USA