Physics-Guided Deepfake Detection for Voice Authentication Systems

Authors: Alireza Mohammadi, Keshav Sood, Dhananjay Thiruvady, Asef Nazari

Published: 2025-12-04 23:37:18+00:00

AI Summary

The paper presents a framework to address the dual threats of sophisticated deepfake attacks and control-plane poisoning in networked voice authentication systems. This system fuses interpretable physics features modeling vocal tract dynamics with representations from a self-supervised learning module via a Multi-Modal Ensemble Architecture. The framework utilizes a Bayesian ensemble to provide uncertainty estimates, enhancing robustness against both advanced synthesis attacks and malicious updates in federated edge learning protocols.

Abstract

Voice authentication systems deployed at the network edge face dual threats: a) sophisticated deepfake synthesis attacks and b) control-plane poisoning in distributed federated learning protocols. We present a framework coupling physics-guided deepfake detection with uncertainty-aware in edge learning. The framework fuses interpretable physics features modeling vocal tract dynamics with representations coming from a self-supervised learning module. The representations are then processed via a Multi-Modal Ensemble Architecture, followed by a Bayesian ensemble providing uncertainty estimates. Incorporating physics-based characteristics evaluations and uncertainty estimates of audio samples allows our proposed framework to remain robust to both advanced deepfake attacks and sophisticated control-plane poisoning, addressing the complete threat model for networked voice authentication.


Key findings
The framework achieved a 6.80% Equal Error Rate (EER) on ASVspoof 2019 LA and 9.05% EER on ASVspoof 2021 LA, demonstrating cross-dataset generalization. Physics-derived features were empirically validated to provide consistent discriminative signals, independent of the neural backbone. The system successfully integrates uncertainty quantification, providing a necessary metric for trust-based aggregation in edge learning while maintaining feasibility for non-interactive edge deployment (149 ms latency).
Approach
The approach couples physics-guided feature extraction (modeling vocal tract dynamics via rotational/translational/vibrational features) with high-level representations from a frozen WavLM backbone. After orthogonal feature fusion via QR decomposition, the combined vector is processed by a Hybrid Detection Backbone consisting of parallel Vision Transformer (ViT), Graph Neural Network (GNN), and Gradient Boosting (LightGBM) branches. Bayesian uncertainty is then quantified using MC Dropout sampling for calibrated trust assessment, necessary for screening malicious client updates in federated learning.
Datasets
ASVspoof 2019 LA and PA, ASVspoof 2021 LA and PA.
Model(s)
WavLM-Large, Vision Transformer (ViT), Graph Neural Network (GNN), LightGBM (Gradient Boosting), Multi-Modal Ensemble Architecture, Bayesian ensemble (MC Dropout).
Author countries
Australia