Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection
Authors: Jun Xue, Tong Zhang, Zhuolin Yi, Yihuan Huang, Yi Chai, Yiyang Zhang, Yanzhen Ren
Published: 2026-05-18 01:36:46+00:00
Comment: Accepted by IJCAI 2026
AI Summary
This paper introduces Phoneme-based Voice Profiling (PVP), a novel personalized framework for speaker-specific speech deepfake detection that shifts from macro-utterance to micro-phonetic analysis. PVP models unique acoustic distributions of a Person-of-Interest's (POI) habitual articulatory patterns using lightweight Gaussian Mixture Models (GMMs) estimated from bona fide reference speech. The framework enables data-efficient profiling, robust generalization to unseen spoofing attacks, and provides fine-grained, phoneme-level interpretability, alongside introducing a large-scale Chinese POI deepfake dataset.
Abstract
The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP