Human Perception of Audio Deepfakes (2026)

The largest listening study on audio deepfake perception to date, collected via our public Spot the Audio Deepfake game in 2025–2026. The released dataset contains 35,532 deepfake-detection judgments from 1,768 anonymous participants across 138 TTS and voice-conversion systems, spanning 10 architecture families: classical seq2seq, VITS, XTTS, flow-matching, diffusion, autoregressive LMs over codec tokens (VALL-E, Bark, Llasa, …), voice conversion, commercial APIs (ElevenLabs, Resemble AI, Cartesia Sonic), and the ASVspoof 5 challenge.

This study is the direct successor to the 2021 ASVspoof-2019 perception study (Müller, Pizzi & Williams, 2022), replicating the same game interface and active-learning sampling so the two periods can be compared head-to-head across a four-year gap in TTS development.

Headline findings
  • Skepticism shift. Human accuracy on fake samples is essentially unchanged from 2021 (72.9% → 71.2%), but accuracy on real audio dropped sharply (72.7% → 64.1%). Listeners increasingly misclassify authentic speech as fake.
  • Hardest architectures. Commercial APIs (61.3%) and AR-LM systems (65.9%) produce the hardest-to-detect samples; classical seq2seq (75.4%) and flow-matching models (76.8%) remain easier.
  • ML reference. A Wav2Vec 2.0 + AASIST detector maintains 94.5% overall accuracy across all categories.

Each row in the released CSV is one judgment: hashed participant id, round number, audio filename, attack id, ground truth, the participant’s response, and the ML detector’s prediction. Demographics (age bracket, IT skill, native-language) are excluded from the public release to prevent re-identification.

Audio lives in four public corpora that the CSV references by filename: the English subset of MLAAD (most fakes), ASVspoof 5, In-The-Wild, and LJSpeech.

Download the dataset on Hugging Face.

The dataset is released under the CC BY-NC 4.0 license. Free for research and non-commercial use with attribution.

How to cite:
@misc{mueller2026erodingtrust,
  title  = {Eroding Trust in Real Speech:
            A Large-Scale Study of Human Audio Deepfake Perception},
  author = {M{\"u}ller, Nicolas M. and Choong, Wei Herng},
  year   = {2026},
  note   = {Preprint forthcoming}
}