Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Authors: Jinyang Wu, Zihan Pan, Qiquan Zhang, Sailor Hardik Bhupendra, Soumik Mondal

Published: 2026-03-10 09:38:03+00:00

Comment: 5 pages, 3 figures

AI Summary

This paper addresses the underutilization of the discrete, hierarchical structure of neural audio codecs in speech deepfake detection. It proposes Quantizer-Aware Static Fusion (QAF-Static), a hierarchy-aware representation learning framework that models quantizer-level contributions through learnable global weighting. This method enables structured codec representations aligned with forensic cues, leading to significant EER reductions while keeping the SSL encoder frozen.

Abstract

Neural audio codecs discretize speech via residual vector quantization (RVQ), forming a coarse-to-fine hierarchy across quantizers. While codec models have been explored for representation learning, their discrete structure remains underutilized in speech deepfake detection. In particular, different quantization levels capture complementary acoustic cues, where early quantizers encode coarse structure and later quantizers refine residual details that reveal synthesis artifacts. Existing systems either rely on continuous encoder features or ignore this quantizer-level hierarchy. We propose a hierarchy-aware representation learning framework that models quantizer-level contributions through learnable global weighting, enabling structured codec representations aligned with forensic cues. Keeping the speech encoder backbone frozen and updating only 4.4% additional parameters, our method achieves relative EER reductions of 46.2% on ASVspoof 2019 and 13.9% on ASVspoof5 over strong baselines.


Key findings
The proposed QAF-Static method achieves relative EER reductions of 46.2% on ASVspoof 2019 LA and 13.9% on ASVspoof 5 over strong baselines, despite only updating 4.4% additional parameters and keeping the SSL backbone frozen. The model learns non-uniform quantizer contributions, indicating effective identification of important RVQ levels for forensic cues. This approach provides complementary information beyond SSL features and demonstrates improved robustness on specific codec families.
Approach
The method proposes Quantizer-Aware Static Fusion (QAF-Static) to explicitly model the coarse-to-fine hierarchy induced by residual vector quantization (RVQ) in neural audio codecs. It learns dimension-wise static importance weights across residual quantizers to aggregate codec embeddings, which are then fused via late concatenation with features from a frozen self-supervised learning (SSL) speech encoder. A lightweight LSTM and linear classifier perform the final detection.
Datasets
ASVspoof 2019 Logical Access (19LA), ASVspoof 5, CodecFake benchmark
Model(s)
WavLM-Large (SSL backbone), Facebook EnCodec (neural audio codec), single-layer LSTM, linear classifier
Author countries
Singapore, Australia