Vulnerabilities of Audio-Based Biometric Authentication Systems Against Deepfake Speech Synthesis

Authors: Mengze Hong, Di Jiang, Zeying Xie, Weiwei Zhao, Guan Wang, Chen Jason Zhang

Published: 2026-01-06 10:55:32+00:00

AI Summary

This paper systematically evaluates the vulnerability of state-of-the-art audio biometric authentication systems against contemporary deepfake speech synthesis models. It reveals that commercial speaker verification systems are easily bypassed by voice clones trained on minimal data, and critically, anti-spoofing detectors suffer a massive performance drop (up to 30x) when encountering synthesis methods unseen during training, demonstrating a critical failure in generalization. These findings urge a move towards architectural innovations and multi-factor authentication for robust security.

Abstract

As audio deepfakes transition from research artifacts to widely available commercial tools, robust biometric authentication faces pressing security threats in high-stakes industries. This paper presents a systematic empirical evaluation of state-of-the-art speaker authentication systems based on a large-scale speech synthesis dataset, revealing two major security vulnerabilities: 1) modern voice cloning models trained on very small samples can easily bypass commercial speaker verification systems; and 2) anti-spoofing detectors struggle to generalize across different methods of audio synthesis, leading to a significant gap between in-domain performance and real-world robustness. These findings call for a reconsideration of security measures and stress the need for architectural innovations, adaptive defenses, and the transition towards multi-factor authentication.


Key findings
State-of-the-art speaker verification systems were found highly vulnerable, with bypass rates reaching 82.7% against modern cloning models trained on 1–5 minutes of speech. Crucially, the XLS-R + AASIST deepfake detector failed to generalize to unseen synthesis architectures, exhibiting a massive performance collapse (EER 24.84%, a 30x degradation) in out-of-domain attacks. The study confirms that current detectors primarily memorize attack-specific statistical patterns rather than learning invariant synthesis properties.
Approach
The authors systematically evaluated the robustness of two state-of-the-art defense systems—ECAPA-TDNN for speaker verification and XLS-R + AASIST for anti-spoofing—using a benchmark dataset of synthetic Mandarin speech generated by three diverse cloning models (GPT-SoVITS, Bert-VITS2, RVC). They measured the bypass rate of the verification system and the Equal Error Rate (EER) of the detector under both in-domain and challenging out-of-domain attack conditions using eight additional unseen TTS models.
Datasets
AISHELL-3, VoxCeleb, ASVspoof 2021 LA, ASVspoof 2021 DF
Model(s)
ECAPA-TDNN, XLS-R, AASIST, GPT-SoVITS, Bert-VITS2, RVC
Author countries
Hong Kong, China