The largest listening study on audio deepfake perception to date, collected via our public Spot the Audio Deepfake game in 2025–2026. The released dataset contains 35,532 deepfake-detection judgments from 1,768 anonymous participants across 138 TTS and voice-conversion systems, spanning 10 architecture families: classical seq2seq, VITS, XTTS, flow-matching, diffusion, autoregressive LMs over codec tokens (VALL-E, Bark, Llasa, …), voice conversion, commercial APIs (ElevenLabs, Resemble AI, Cartesia Sonic), and the ASVspoof 5 challenge.
This study is the direct successor to the 2021 ASVspoof-2019 perception study (Müller, Pizzi & Williams, 2022), replicating the same game interface and active-learning sampling so the two periods can be compared head-to-head across a four-year gap in TTS development.
Each row in the released CSV is one judgment: hashed participant id, round number, audio filename, attack id, ground truth, the participant’s response, and the ML detector’s prediction. Demographics (age bracket, IT skill, native-language) are excluded from the public release to prevent re-identification.
Audio lives in four public corpora that the CSV references by filename: the English subset of MLAAD (most fakes), ASVspoof 5, In-The-Wild, and LJSpeech.
Download the dataset on Hugging Face.
The dataset is released under the CC BY-NC 4.0 license. Free for research and non-commercial use with attribution.
@misc{mueller2026erodingtrust,
title = {Eroding Trust in Real Speech:
A Large-Scale Study of Human Audio Deepfake Perception},
author = {M{\"u}ller, Nicolas M. and Choong, Wei Herng},
year = {2026},
note = {Preprint forthcoming}
}