HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

Authors: Mohd Mujtaba Akhtar, Girish, Muskaan Singh

Published: 2026-04-19 22:26:28+00:00

Comment: Accepted to ACL 2026

AI Summary

This study introduces Healthcare Codec-Fake Detection (HCFD), a new task for identifying codec-generated synthetic speech under pathological conditions. The authors release Healthcare CodecFake (HCFK), the first pathology-aware dataset containing paired real and synthesized speech across multiple clinical conditions and codecs. They propose PHOENIX-Mamba, a geometry-aware framework that models codec-fakes as multiple self-discovered modes in hyperbolic space, achieving state-of-the-art performance on the HCFD task.

Abstract

In this study, we present Healthcare Codec-Fake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We intentionally focus on codec based synthetic speech in this work, since neural codec decoding forms a core building block in modern speech generation pipelines. First, we release Healthcare CodecFake, the first pathology-aware dataset containing paired real and NAC-synthesized speech across multipl clinical conditions and codec families. Our evaluations show that SOTA codec-fake detectors trained primarily on healthy speech perform poorly on Healthcare CodecFake, highlighting the need for HCFD-specific models. Second, we demonstrate that PaSST outperforms existing speech-based models for HCFD, benefiting from its patch-based spectro-temporal representation. Finally, we propose PHOENIX-Mamba, a geometry-aware framework that models codec-fakes as multiple self-discovered modes in hyperbolic space and achieves the strongest performance on HCFD across clinical conditions and codecs. Experiments on HCFK show that PHOENIX-Mamba (PaSST) achieves the best overall performance, reaching 97.04 Acc on E-Dep, 96.73 on E-Alz, and 96.57 on E-Dys, while maintaining strong results on Chinese with 94.41 (Dep), 94.40 (Alz), and 93.20 (Dys). This geometry-aware formulation enables self-discovered clustering of heterogeneous codec-fake modes in hyperbolic space, facilitating robust discrimination under pathological speech variability. PHOENIX-Mamba achieves topmost performance on the HCFD task across clinical conditions and codecs.

Key findings

State-of-the-art codec-fake detectors trained on healthy speech perform poorly on HCFK, indicating a significant domain shift for pathological speech. PaSST proves to be the strongest single-representation baseline. PHOENIX-Mamba, especially when coupled with PaSST, consistently achieves the best overall performance (e.g., 97.04% Accuracy on English Depression and significantly reduced EERs), demonstrating the effectiveness of its multi-evidence pooling and geometry-aware multi-mode reasoning in hyperbolic space for robust deepfake detection in healthcare speech.

Approach

The authors define the HCFD task and construct the HCFK dataset by resynthesizing pathological speech (from conditions like Depression, Alzheimer's, Dysarthria in English and Chinese) using diverse neural audio codecs. They then propose PHOENIX-Mamba, a framework that integrates a Mamba-style selective state-space backbone for long-context temporal modeling with hyperbolic, prototype-based clustering to capture heterogeneous codec artifacts via multiple self-discovered modes in hyperbolic space, using a multi-evidence pooling mechanism.

Datasets

Healthcare CodecFake (HCFK), DAIC-WOZ, EATD-Corpus, ADReSS/ADReSSo, NCMMSC, TORGO, Chinese Dysarthria Speech Database (CDSD)

Model(s)

PHOENIX-Mamba (proposed), PaSST, WavLM, Wav2vec 2.0, Whisper, X-vector (as pre-trained encoders), AASIST, RawNet2, LCNN, SAMO (as baselines)

Author countries

India, UK

← Previous