SyntheticPop: Attacking Speaker Verification Systems With Synthetic VoicePops

Authors: Eshaq Jamdar, Amith Kamath Belman

Published: 2025-02-13 18:05:12+00:00

AI Summary

This paper introduces SyntheticPop, a novel data poisoning attack method targeting Voice Authentication (VA) systems enhanced with the VoicePop defense mechanism. SyntheticPop embeds synthetic 'pop' noises into spoofed audio samples during training, significantly degrading the VA+VoicePop system's phoneme recognition capabilities. The attack achieves a high success rate, demonstrating a critical vulnerability in current voice authentication systems against logical attacks.

Abstract

Voice Authentication (VA), also known as Automatic Speaker Verification (ASV), is a widely adopted authentication method, particularly in automated systems like banking services, where it serves as a secondary layer of user authentication. Despite its popularity, VA systems are vulnerable to various attacks, including replay, impersonation, and the emerging threat of deepfake audio that mimics the voice of legitimate users. To mitigate these risks, several defense mechanisms have been proposed. One such solution, Voice Pops, aims to distinguish an individual's unique phoneme pronunciations during the enrollment process. While promising, the effectiveness of VA+VoicePop against a broader range of attacks, particularly logical or adversarial attacks, remains insufficiently explored. We propose a novel attack method, which we refer to as SyntheticPop, designed to target the phoneme recognition capabilities of the VA+VoicePop system. The SyntheticPop attack involves embedding synthetic pop noises into spoofed audio samples, significantly degrading the model's performance. We achieve an attack success rate of over 95% while poisoning 20% of the training dataset. Our experiments demonstrate that VA+VoicePop achieves 69% accuracy under normal conditions, 37% accuracy when subjected to a baseline label flipping attack, and just 14% accuracy under our proposed SyntheticPop attack, emphasizing the effectiveness of our method.


Key findings
The VA+VoicePop system achieved 69% accuracy under normal conditions and 37% accuracy with a baseline label flipping attack. However, under the proposed SyntheticPop attack (poisoning 20% of the training data), the system's accuracy plummeted to 14%. The SyntheticPop attack demonstrated a high success rate of over 95%, causing the model to misclassify many fakes as real and exposing a significant vulnerability to data poisoning.
Approach
The SyntheticPop attack involves injecting synthetic 'pop' noises, generated as a subtle sine wave with specific amplitude and frequency (e.g., 0.5 amplitude, 90 Hz frequency), directly into spoofed audio samples within the training dataset. This manipulation aims to confuse the VA+VoicePop system's GFCC feature extraction, causing it to misclassify fake audio as real during the authentication process.
Datasets
ASVSpoof 2019 dataset
Model(s)
Support Vector Machine (SVM) classifier for VA+VoicePop, which uses Gammatone Frequency Cepstral Coefficient (GFCC) features for liveness detection.
Author countries
United States