VoxAnchor: Grounding Speech Authenticity in Throat Vibration via mmWave Radar

View on arXiv ← Back to list

Authors: Mingda Han, Huanqi Yang, Chaoqun Li, Wenhao Li, Guoming Zhang, Yanni Yang, Yetong Cao, Weitao Xu, Pengfei Hu

Published: 2026-03-29 07:51:08+00:00

AI Summary

VoxAnchor is a novel system that verifies speech authenticity by leveraging the physical coherence between speech acoustics and contactless millimeter-wave (mmWave) radar-sensed throat vibrations. It employs a cross-modal framework with modality-specific encoders and contrastive learning to detect subtle, word-level mismatches that expose diverse audio forgeries. The system achieves robust, fine-grained detection with an overall Equal Error Rate (EER) of 0.0173 across various attack types, including editing, splicing, replay, and deepfakes.

Abstract

Rapid advances in speech synthesis and audio editing have made realistic forgeries increasingly accessible, yet existing detection methods remain vulnerable to tampering or depend on visual/wearable sensors. In this paper, we present VoxAnchor, a system that physically grounds audio authentication in vocal dynamics by leveraging the inherent coherence between speech acoustics and radar-sensed throat vibrations. VoxAnchor uses contactless millimeter-wave radar to capture fine-grained throat vibrations that are tightly coupled with human speech production, establishing a hard-to-forge anchor rooted in human physiology. The design comprises three main components: (1) a cross-modal frame-work that uses modality-specific encoders and contrastive learning to detect subtle mismatches at word granularity; (2) a phase-aware pipeline that extracts physically consistent, temporally faithful throat vibrations; and (3) a dual-stage strategy that combines signal-level onset detection and semantic-level coherence to align asynchronous radar and audio streams. Unlike liveness detection, which only confirms whether speech occurred, VoxAnchor verifies what was spoken through word-level content consistency, exposing localized edits that preserve identity and global authenticity cues. Extensive evaluations show that VoxAnchor achieves robust, fine-grained detection across diverse forgeries (editing, splicing, replay, deepfake) and conditions, with an overall EER of 0.017, low latency, and modest computational cost.

Key findings

VoxAnchor achieved an impressive overall EER of 0.0173 and an AUC of 0.9946, demonstrating strong discrimination against various forgeries, including physical replay and AI-synthesized deepfakes. It maintained high True Acceptance Rates (e.g., over 92% for word-level attacks) with low False Acceptance Rates, even generalizing well to unseen speakers and different languages. The system proved robust against moderate environmental noise and slight body movements, while operating with low latency suitable for real-time applications.

Approach

VoxAnchor physically grounds speech authentication by comparing synchronized audio signals with fine-grained throat vibrations captured by a mmWave radar. The system uses a C3-Net architecture comprising Deformable Convolutional Networks (DCN) and Vision Transformers (ViT) as modality-specific encoders, followed by a Cross-Modal Attention (CMA) module. It is trained using a combination of InfoNCE loss for global semantic alignment and Normalized Cross-Correlation (NCC) loss for local temporal coherence to identify inconsistencies indicative of forgery.

Datasets

A custom-collected dataset from 21 volunteers reading fixed texts, augmented with two adversarial testing datasets (Sentence-level Tampering and Word-level Tampering) created using Whisper model for segmentation and OpenAI TTS for generative deepfakes.

Model(s)

C3-Net (CLIP-inspired Cross-modal Contrastive Coherence Network), Deformable Convolutional Network (DCN), Vision Transformer (ViT), Cross-Modal Attention (CMA) module.

Author countries

China

← Previous