Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

Authors: Kamel Kamel, Hridoy Sankar Dutta, Keshav Sood, Sunil Aryal

Published: 2025-09-09 12:43:59+00:00

AI Summary

This paper introduces the Spectral Masking and Interpolation Attack (SMIA), a novel black-box adversarial attack designed to bypass both voice authentication systems (VAS) and anti-spoofing countermeasures (CMs). SMIA strategically manipulates inaudible frequency regions of AI-generated audio, creating adversarial samples that are perceptually authentic yet effectively deceive state-of-the-art defenses. The attack demonstrates high success rates, highlighting critical vulnerabilities in current voice biometric security paradigms.

Abstract

Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high-security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti-spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state-of-the-art (SOTA) models across multiple tasks, under simulated real-world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next-generation defenses that employ dynamic, context-aware frameworks capable of evolving with the threat landscape.


Key findings
SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. The attack's stochastic nature makes it stealthier and more resilient to detection compared to prior methods, avoiding conspicuous artifacts. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks, necessitating a paradigm shift toward dynamic defenses.
Approach
The SMIA framework utilizes a black-box optimization algorithm, specifically a Tree-structured Parzen Estimator (TPE), to iteratively search for optimal perturbation parameters based on feedback from the target system. These perturbations are applied through spectral masking and interpolation techniques, which subtly alter low-energy, inaudible time-frequency bins in the audio spectrogram to evade detection while preserving speaker identity.
Datasets
LibriSpeech, ASVSpoof 2019
Model(s)
RawNet2, RawGAT-ST, RawPC-Darts
Author countries
Australia, India