An Effective Energy Mask-based Adversarial Evasion Attacks against Misclassification in Speaker Recognition Systems

Authors: Chanwoo Park, Chanwoo Kim

Published: 2026-01-29 22:58:20+00:00

AI Summary

This research introduces Masked Energy Perturbation (MEP), a novel adversarial attack method that uses power spectrum energy masking to create imperceptible perturbations in voice data. MEP targets low-energy regions in the frequency domain, making the attacks less noticeable to humans while effectively misleading speaker recognition systems. The method demonstrates strong performance in both audio quality preservation and evasion effectiveness against state-of-the-art speaker recognition models.

Abstract

Evasion attacks pose significant threats to AI systems, exploiting vulnerabilities in machine learning models to bypass detection mechanisms. The widespread use of voice data, including deepfakes, in promising future industries is currently hindered by insufficient legal frameworks. Adversarial attack methods have emerged as the most effective countermeasure against the indiscriminate use of such data. This research introduces masked energy perturbation (MEP), a novel approach using power spectrum for energy masking of original voice data. MEP applies masking to small energy regions in the frequency domain before generating adversarial perturbations, targeting areas less noticeable to the human auditory model. The study primarily employs advanced speaker recognition models, including ECAPA-TDNN and ResNet34, which have shown remarkable performance in speaker verification tasks. The proposed MEP method demonstrated strong performance in both audio quality and evasion effectiveness. The energy masking approach effectively minimizes the perceptual evaluation of speech quality (PESQ) degradation, indicating that minimal perceptual distortion occurs to the human listener despite the adversarial perturbations. Specifically, in the PESQ evaluation, the relative performance of the MEP method was 26.68% when compared to the fast gradient sign method (FGSM) and iterative FGSM.


Key findings
The I-MEP method achieved the highest PESQ scores (3.7657-3.7709) and SNR values (38.07-38.14 dB) across all tested speaker recognition models, indicating superior audio quality preservation compared to conventional attacks. MEP and I-MEP consistently outperformed FGSM, I-FGSM, MI-FGSM, and PGD in both audio quality and achieving higher Equal Error Rates (EER) for evasion, successfully misleading speaker recognition systems with minimal perceptual distortion.
Approach
The authors propose Masked Energy Perturbation (MEP), which first calculates the energy distribution across time-frequency bins and applies masking to small energy regions in the frequency domain. Adversarial perturbations are then generated and applied only to the selected high-energy regions. An iterative version, I-MEP, refines these perturbations over multiple steps to enhance attack success while preserving audio quality.
Datasets
LibriSpeech, VoxCeleb, VoxCeleb2
Model(s)
ECAPA-TDNN, ResNetSE34-L, ResNetSE34-V
Author countries
Republic of Korea