Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces

Authors: Farhan Sheth, Girish, Mohd Mujtaba Akhtar, Muskaan Singh

Published: 2025-11-13 20:43:31+00:00

Comment: Accepted to IJCNLP-AACL 2025

AI Summary

This paper introduces RHYME, a novel framework for generalizable audio deepfake detection across diverse speech synthesis paradigms, including conventional TTS, diffusion, and flow-matching generators. RHYME achieves synthesis-invariant detection by fusing utterance-level embeddings from pretrained speech encoders using non-Euclidean projections into hyperbolic and spherical manifolds, unified via Riemannian barycentric averaging. This geometry-aware approach effectively aligns structural distortions common to synthetic speech, leading to robust generalization in cross-paradigm and unseen-generator conditions.

Abstract

In this work, we address the challenge of generalizable audio deepfake detection (ADD) across diverse speech synthesis paradigms-including conventional text-to-speech (TTS) systems and modern diffusion or flow-matching (FM) based generators. Prior work has mostly targeted individual synthesis families and often fails to generalize across paradigms due to overfitting to generation-specific artifacts. We hypothesize that synthetic speech, irrespective of its generative origin, leaves behind shared structural distortions in the embedding space that can be aligned through geometry-aware modeling. To this end, we propose RHYME, a unified detection framework that fuses utterance-level embeddings from diverse pretrained speech encoders using non-Euclidean projections. RHYME maps representations into hyperbolic and spherical manifolds-where hyperbolic geometry excels at modeling hierarchical generator families, and spherical projections capture angular, energy-invariant cues such as periodic vocoder artifacts. The fused representation is obtained via Riemannian barycentric averaging, enabling synthesis-invariant alignment. RHYME outperforms individual PTMs and homogeneous fusion baselines, achieving top performance and setting new state-of-the-art in cross-paradigm ADD.


Key findings
RHYME consistently outperforms individual pretrained models and homogeneous fusion baselines, achieving significantly lower Equal Error Rates (EERs) in cross-corpus and unseen-generator conditions. It achieves state-of-the-art performance, with an EER of 14.12% compared to an average of 32.44% for the AASIST-L model in cross-domain settings. The geometry-aware fusion strategy enables RHYME to learn shared, synthesis-invariant structural cues, leading to robust generalization even when faced with entirely unseen synthetic speech generators and improving the reliability of probability estimates.
Approach
The RHYME framework fuses utterance-level embeddings from multiple pretrained speech encoders. It projects these representations into hyperbolic space to model hierarchical generator families and spherical space to capture angular, energy-invariant cues like periodic vocoder artifacts. These projected embeddings are then combined using Riemannian barycentric averaging in the Poincaré ball, forming a synthesis-agnostic representation for classification.
Datasets
DFADD (Diffusion and Flow-Matching Based Audio Deepfake Dataset), ASVSpoof 2019 (Logical Access subset). The DFADD benchmark was also extended with two new diffusion-based generators.
Model(s)
RHYME framework; pretrained speech encoders used as backbones include USAD2, PaSST3, Whisper4, x-vector5, WavLM6, HuBERT7, and Wav2Vec 2.08.
Author countries
India, UK