Human-AI Ensembles Improve Deepfake Detection in Low-to-Medium Quality Videos

Authors: Marco Postiglione, Isabel Gortner, V. S. Subrahmanian

Published: 2026-03-15 23:25:34+00:00

AI Summary

This paper evaluates human and AI deepfake detection capabilities across varying video qualities, finding that humans significantly outperform state-of-the-art AI detectors, particularly in low-to-medium quality videos. It demonstrates that human and AI errors are complementary, allowing hybrid human-AI ensembles to achieve superior accuracy and eliminate high-confidence errors. The findings suggest that effective real-world deepfake detection, especially for non-professionally produced videos, necessitates human-AI collaboration.

Abstract

Deepfake detection is widely framed as a machine learning problem, yet how humans and AI detectors compare under realistic conditions remains poorly understood. We evaluate 200 human participants and 95 state-of-the-art AI detectors across two datasets: DF40, a standard benchmark, and CharadesDF, a novel dataset of videos of everyday activities. CharadesDF was recorded using mobile phones leading to low/moderate quality videos compared to the more professionally captured DF40. Humans outperform AI detectors on both datasets, with the gap widening in the case of CharadesDF where AI accuracy collapses to near chance (0.537) while humans maintain robust performance (0.784). Human and AI errors are complementary: humans miss high-quality deepfakes while AI detectors flag authentic videos as fake, and hybrid human-AI ensembles reduce high-confidence errors. These findings suggest that effective real-world deepfake detection, especially in non-professionally produced videos, requires human-AI collaboration rather than AI algorithms alone.


Key findings
Humans substantially outperform AI detectors, particularly on challenging low-to-medium quality videos where AI performance approaches chance levels. Human and AI errors are complementary (humans miss high-quality fakes, AI misclassifies real videos as fake), enabling hybrid human-AI ensembles to eliminate catastrophic errors and significantly boost overall accuracy. While face size is a strong predictor for both, AI detectors are more sensitive to low-level visual properties, and demographic factors have limited predictive power for human detection ability.
Approach
The authors conducted experiments comparing 200 human participants and 95 state-of-the-art AI deepfake detectors on two datasets: DF40 (a standard benchmark) and CharadesDF (a novel dataset of low-to-medium quality, user-generated-like videos). They analyzed individual and ensemble performance, error patterns, confidence calibration, and the influence of visual quality factors and demographics. Hybrid human-AI ensembles were formed by aggregating quality-weighted predictions from both groups.
Datasets
DF40, CharadesDF, FaceForensics++, CelebDF-v2
Model(s)
32 state-of-the-art deepfake detection architectures, including F3Net, SPSL, SRM, Multi-Attention, FFD, RECCE, UCF, PCL-I2G, UIA-ViT, TimeSformer, I3D, STIL, VideoMAE, CLIP, X-CLIP. Backbone architectures included EfficientNet-B4, Xception, and various ResNet variants.
Author countries
United States