DREAM: A Benchmark Study for Deepfake REalism AssessMent

Authors: Bo Peng, Zichuan Wang, Sheng Yu, Xiaochuan Jin, Wei Wang, Jing Dong

Published: 2025-10-11 06:41:49+00:00

AI Summary

This paper introduces DREAM, a comprehensive benchmark for Deepfake Visual Realism Assessment (VRA), focusing on modeling subjective human perception of deepfake quality. The benchmark comprises a deepfake video dataset, a massive annotation set of 140,000 realism scores, and textual descriptions obtained from 3,500 human annotators. The study also evaluates 16 representative VRA methods, including a novel description-aligned CLIP (DA-CLIP) model.

Abstract

Deep learning based face-swap videos, widely known as deepfakes, have drawn wide attention due to their threat to information credibility. Recent works mainly focus on the problem of deepfake detection that aims to reliably tell deepfakes apart from real ones, in an objective way. On the other hand, the subjective perception of deepfakes, especially its computational modeling and imitation, is also a significant problem but lacks adequate study. In this paper, we focus on the visual realism assessment of deepfakes, which is defined as the automatic assessment of deepfake visual realism that approximates human perception of deepfakes. It is important for evaluating the quality and deceptiveness of deepfakes which can be used for predicting the influence of deepfakes on Internet, and it also has potentials in improving the deepfake generation process by serving as a critic. This paper prompts this new direction by presenting a comprehensive benchmark called DREAM, which stands for Deepfake REalism AssessMent. It is comprised of a deepfake video dataset of diverse quality, a large scale annotation that includes 140,000 realism scores and textual descriptions obtained from 3,500 human annotators, and a comprehensive evaluation and analysis of 16 representative realism assessment methods, including recent large vision language model based methods and a newly proposed description-aligned CLIP method. The benchmark and insights included in this study can lay the foundation for future research in this direction and other related areas.


Key findings
The proposed DA-CLIP model achieved the best overall performance (avr 0.827), confirming the boosting effect of incorporating description alignment through multi-modal training. Pretraining models on deepfake detection datasets significantly improves realism assessment accuracy, suggesting a close operational relationship between the two tasks. Furthermore, the cross-modal approach successfully enabled the generation of accurate, fine-grained textual explanations of detected artifacts.
Approach
The researchers benchmark 16 diverse methods, ranging from hand-crafted features to deep fine-tuning and Vision-Language Models (VLMs). They propose DA-CLIP, which adapts the CLIP model by leveraging a Swin-transformer visual backbone and aligning visual features with human textual descriptions of artifacts via a cross-modal similarity loss. This setup allows the model to predict Mean Opinion Scores (MOS) while providing textual-based explanations.
Datasets
DREAM (Benchmark dataset constructed using videos from DFGC-2022 and newly annotated real videos), DFGC-2022, ImageNet, VGG-Face, DFDC.
Model(s)
Swin-transformer v2, ConvNeXt, CLIP, mPLUG-Owl2, InternVL2.5-8B, ResNet50, VGG-Face, DA-CLIP (proposed).
Author countries
China