SyntheticPop: Attacking Speaker Verification Systems With Synthetic VoicePops

View on arXiv ← Back to list

Authors: Eshaq Jamdar, Amith Kamath Belman

Published: 2025-02-13 18:05:12+00:00

AI Summary

This paper introduces SyntheticPop, a novel attack method targeting the VoicePop speaker verification system. SyntheticPop embeds synthetic pop noises into spoofed audio samples, significantly reducing the system's accuracy (to 14% from 69%). The attack achieves over 95% success rate with only 20% of the training data poisoned.

Abstract

Voice Authentication (VA), also known as Automatic Speaker Verification (ASV), is a widely adopted authentication method, particularly in automated systems like banking services, where it serves as a secondary layer of user authentication. Despite its popularity, VA systems are vulnerable to various attacks, including replay, impersonation, and the emerging threat of deepfake audio that mimics the voice of legitimate users. To mitigate these risks, several defense mechanisms have been proposed. One such solution, Voice Pops, aims to distinguish an individual's unique phoneme pronunciations during the enrollment process. While promising, the effectiveness of VA+VoicePop against a broader range of attacks, particularly logical or adversarial attacks, remains insufficiently explored. We propose a novel attack method, which we refer to as SyntheticPop, designed to target the phoneme recognition capabilities of the VA+VoicePop system. The SyntheticPop attack involves embedding synthetic pop noises into spoofed audio samples, significantly degrading the model's performance. We achieve an attack success rate of over 95% while poisoning 20% of the training dataset. Our experiments demonstrate that VA+VoicePop achieves 69% accuracy under normal conditions, 37% accuracy when subjected to a baseline label flipping attack, and just 14% accuracy under our proposed SyntheticPop attack, emphasizing the effectiveness of our method.

Key findings

The SyntheticPop attack significantly reduced the accuracy of the VoicePop system from 69% to 14%. A simple label-flipping attack had a much smaller impact. The high success rate (95%) of SyntheticPop highlights the vulnerability of VoicePop to data poisoning attacks.

Approach

The authors recreated the VoicePop system and employed two attack methods: label flipping and SyntheticPop. SyntheticPop injects synthetic pop noises into spoofed audio samples to mimic genuine speech characteristics and confuse the system's phoneme recognition. The effectiveness of each attack was evaluated using accuracy metrics.

Datasets

ASVSpoof 2019 dataset

Model(s)

Support Vector Machine (SVM)

Author countries

USA

← Previous