Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection

Authors: Jun Xue, Tong Zhang, Zhuolin Yi, Yihuan Huang, Yi Chai, Yiyang Zhang, Yanzhen Ren

Published: 2026-05-18 01:36:46+00:00

Comment: Accepted by IJCAI 2026

AI Summary

This paper introduces Phoneme-based Voice Profiling (PVP), a novel personalized framework for speaker-specific speech deepfake detection that shifts from macro-utterance to micro-phonetic analysis. PVP models unique acoustic distributions of a Person-of-Interest's (POI) habitual articulatory patterns using lightweight Gaussian Mixture Models (GMMs) estimated from bona fide reference speech. The framework enables data-efficient profiling, robust generalization to unseen spoofing attacks, and provides fine-grained, phoneme-level interpretability, alongside introducing a large-scale Chinese POI deepfake dataset.

Abstract

The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP

Key findings

PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, demonstrating substantial EER reductions and AUC improvements across different SSL backbones and datasets (ZH-Famous and EN-Famous). The framework exhibits strong effectiveness under cross-lingual and unseen generation conditions, providing robust generalization without spoof-specific training. Furthermore, PVP offers fine-grained, phoneme-level interpretability by exposing speaker-centric inconsistencies as phonetic anomaly cues, enhancing forensic analysis.

Approach

The proposed PVP framework creates speaker-specific phonetic profiles by modeling the acoustic distributions of individual phonemes using lightweight Gaussian Mixture Models (GMMs), trained solely on bona fide reference speech. During inference, it evaluates test utterances' phoneme realizations against these profiles and fuses the phoneme-level consistency scores with a global speaker embedding score to determine authenticity. A tiered decision mechanism handles linguistic sparsity.

Datasets

ZH-Famous (a newly introduced large-scale Chinese POI deepfake dataset), Famous Figures (EN-Famous) dataset.

Model(s)

Gaussian Mixture Models (GMMs) for phoneme and global speaker profiling. For feature extraction, it utilizes SSL models such as wav2vec2-large-xlsr-53 (for phoneme boundary alignment), ECAPA-TDNN (for global speaker embeddings), and various backbones like HuBERT-xlarge, wav2vec2-small, wav2vec2-large, wav2vec2-xlsr-1b, and MMS-300m for phoneme-level acoustic embeddings.

Author countries

China