Human perception of audio deepfakes: the role of language and speaking style

Authors: Eugenia San Segundo, Aurora López-Jareño, Xin Wang, Junichi Yamagishi

Published: 2025-12-10 01:04:59+00:00

AI Summary

This study investigates human perceptual abilities in identifying audio deepfakes across different languages (Spanish/Japanese) and speaking styles (audiobooks/interviews). Through an experiment involving 54 native listeners, the authors found an overall accuracy of 59.11%, indicating high vulnerability to synthetic voices. Qualitative analysis showed listeners rely heavily on suprasegmental features and non-linguistic cues like breathing, with performance improving significantly for native language stimuli.

Abstract

Audio deepfakes have reached a level of realism that makes it increasingly difficult to distinguish between human and artificial voices, which poses risks such as identity theft or spread of disinformation. Despite these concerns, research on humans' ability to identify deepfakes is limited, with most studies focusing on English and very few exploring the reasons behind listeners' perceptual decisions. This study addresses this gap through a perceptual experiment in which 54 listeners (28 native Spanish speakers and 26 native Japanese speakers) classified voices as natural or synthetic, and justified their choices. The experiment included 80 stimuli (50% artificial), organized according to three variables: language (Spanish/Japanese), speech style (audiobooks/interviews), and familiarity with the voice (familiar/unfamiliar). The goal was to examine how these variables influence detection and to analyze qualitatively the reasoning behind listeners' perceptual decisions. Results indicate an average accuracy of 59.11%, with higher performance on authentic samples. Judgments of vocal naturalness rely on a combination of linguistic and non-linguistic cues. Comparing Japanese and Spanish listeners, our qualitative analysis further reveals both shared cues and notable cross-linguistic differences in how listeners conceptualize the humanness of speech. Overall, participants relied primarily on suprasegmental and higher-level or extralinguistic characteristics - such as intonation, rhythm, fluency, pauses, speed, breathing, and laughter - over segmental features. These findings underscore the complexity of human perceptual strategies in distinguishing natural from artificial speech and align partly with prior research emphasizing the importance of prosody and phenomena typical of spontaneous speech, such as disfluencies.


Key findings
Overall human accuracy was low (59.11%), demonstrating a bias toward classifying audio as natural (higher performance on authentic samples). Detection accuracy improved when stimuli were presented in the listener's native language. Listeners primarily relied on suprasegmental features (e.g., intonation, rhythm) and non-linguistic characteristics (e.g., breathing, laughter) rather than individual segmental features to distinguish between real and fake speech.
Approach
The researchers conducted a perceptual experiment with 54 native Spanish and Japanese listeners, asking them to classify 80 audio stimuli (balanced across language, style, and authenticity) as natural or artificial. The quantitative results were analyzed using generalized linear mixed models, while open-ended justifications were subject to a qualitative thematic analysis based on phonetic features (segments, suprasegmentals, higher-level features).
Datasets
LibriVox, YouTube (for audiobooks), VoxCeleb-ESP, EACELEB (for celebrity interviews).
Model(s)
ElevenLabs' Text-to-Speech (TTS) software ("Eleven Multilingual v2" model) was used to generate the synthetic stimuli. No machine learning detection model was developed or tested in this paper.
Author countries
Spain, Japan