Speech Replay Detection with x-Vector Attack Embeddings and Spectral Features

Authors: Jennifer Williams, Joanna Rownicka

Published: 2019-09-23 12:27:04+00:00

Comment: Presented at Interspeech 2019

AI Summary

This paper presents a system for the ASVspoof 2019 Challenge Physical Access (PA) task, focusing on detecting speech replay attacks. The proposed countermeasure utilizes convolutional neural networks (CNNs) with a combined feature representation of x-vector attack embeddings and sub-band spectral centroid magnitude coefficients (SCMCs). The system demonstrates improved performance over challenge baselines, suggesting that x-vector attack embeddings can regularize CNN predictions for enhanced robustness.

Abstract

We present our system submission to the ASVspoof 2019 Challenge Physical Access (PA) task. The objective for this challenge was to develop a countermeasure that identifies speech audio as either bona fide or intercepted and replayed. The target prediction was a value indicating that a speech segment was bona fide (positive values) or spoofed (negative values). Our system used convolutional neural networks (CNNs) and a representation of the speech audio that combined x-vector attack embeddings with signal processing features. The x-vector attack embeddings were created from mel-frequency cepstral coefficients (MFCCs) using a time-delay neural network (TDNN). These embeddings jointly modeled 27 different environments and 9 types of attacks from the labeled data. We also used sub-band spectral centroid magnitude coefficients (SCMCs) as features. We included an additive Gaussian noise layer during training as a way to augment the data to make our system more robust to previously unseen attack examples. We report system performance using the tandem detection cost function (tDCF) and equal error rate (EER). Our approach performed better that both of the challenge baselines. Our technique suggests that our x-vector attack embeddings can help regularize the CNN predictions even when environments or attacks are more challenging.


Key findings
The system combining SCMC features with scaled x-vector attack embeddings (xEAs) achieved better performance (lower tDCF and EER) than both LFCC-GMM and CQCC-GMM challenge baselines on development and evaluation sets. The x-vector attack embeddings, particularly when combined with signal features, were found to be effective in capturing replay device quality and environmental variations, contributing to the system's robustness against unseen attack examples.
Approach
The system uses convolutional neural networks (CNNs) trained on a hybrid feature set. This set comprises x-vector attack embeddings, created from MFCCs using a Time-Delay Neural Network (TDNN) to model 27 environments and 9 attack types, concatenated with sub-band spectral centroid magnitude coefficients (SCMCs). An additive Gaussian noise layer is incorporated during training for data augmentation and robustness.
Datasets
ASVspoof 2019 Challenge Physical Access (PA) dataset
Model(s)
Convolutional Neural Network (CNN), Time-Delay Neural Network (TDNN) for x-vector extraction
Author countries
United Kingdom