Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Authors: Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

Published: 2024-01-20 07:30:22+00:00

Comment: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)

AI Summary

This paper introduces a Generalized Standalone Automatic Speaker Verification (G-SASV) system to detect spoofing attacks without requiring a separate countermeasure (CM) module during the authentication phase. It enhances a simple deep neural network backend by leveraging limited CM training data through domain adaptation and multi-task learning, integrating spoof embeddings at the training stage. Experiments on the ASVspoof 2019 logical access dataset demonstrate significant improvements over statistical ASV backends.

Abstract

It is now well-known that automatic speaker verification (ASV) systems can be spoofed using various types of adversaries. The usual approach to counteract ASV systems against such attacks is to develop a separate spoofing countermeasure (CM) module to classify speech input either as a bonafide, or a spoofed utterance. Nevertheless, such a design requires additional computation and utilization efforts at the authentication stage. An alternative strategy involves a single monolithic ASV system designed to handle both zero-effort imposter (non-targets) and spoofing attacks. Such spoof-aware ASV systems have the potential to provide stronger protections and more economic computations. To this end, we propose to generalize the standalone ASV (G-SASV) against spoofing attacks, where we leverage limited training data from CM to enhance a simple backend in the embedding space, without the involvement of a separate CM module during the test (authentication) phase. We propose a novel yet simple backend classifier based on deep neural networks and conduct the study via domain adaptation and multi-task integration of spoof embeddings at the training stage. Experiments are conducted on the ASVspoof 2019 logical access dataset, where we improve the performance of statistical ASV backends on the joint (bonafide and spoofed) and spoofed conditions by a maximum of 36.2% and 49.8% in terms of equal error rates, respectively.


Key findings
The G-SASV system achieved substantial performance improvements on the ASVspoof 2019 LA dataset, demonstrating a maximum reduction of 36.2% in joint Equal Error Rate (EER) and 49.8% in spoof EER compared to statistical ASV backends. Multi-task learning, particularly when integrating 'synthesizer type' as meta-attributes for auxiliary classification, was highly effective in improving generalization abilities while removing the need for a separate CM module at authentication.
Approach
The proposed G-SASV approach generalizes an ASV system against spoofing attacks by employing a simple 3-layer Multi-Layer Perceptron (MLP) as a backend classifier, operating solely on concatenated speaker embeddings at authentication. During training, the system integrates spoofing information by leveraging limited CM data via domain adaptation techniques (network-wise and structural transformation) and multi-task learning, incorporating synthetic spoof embeddings and meta-attributes (e.g., attack type, vocoder, synthesizer) from CM data into the training objective.
Datasets
ASVspoof 2019 LA, VoxCeleb1 (VoxCeleb1-E, VoxCeleb1-H)
Model(s)
ECAPA-TDNN (for speaker embeddings), AASIST (used to create synthetic spoof embeddings during training), 3-layer Multi-Layer Perceptron (MLP) (as the backend classifier)
Author countries
France, Finland, Japan, India, Hong Kong