Replay Spoofing Countermeasure Using Autoencoder and Siamese Network on ASVspoof 2019 Challenge

Authors: Mohammad Adiban, Hossein Sameti, Saeedreza Shehnepoor

Published: 2019-10-29 16:03:04+00:00

AI Summary

This paper proposes a novel replay spoofing countermeasure for Automatic Speaker Verification (ASV) systems to combat replay attacks. The approach utilizes Constant Q Cepstral Coefficient (CQCC) features, processes them through an autoencoder to capture informative and noise-aware representations, and employs a Siamese network for classification. Experiments on the ASVspoof 2019 dataset demonstrate significant improvements in Equal Error Rate (EER) and Tandem Detection Cost Function (t-DCF) over baseline systems.

Abstract

Automatic Speaker Verification (ASV) is the process of identifying a person based on the voice presented to a system. Different synthetic approaches allow spoofing to deceive ASV systems (ASVs), whether using techniques to imitate a voice or recunstruct the features. Attackers try to beat up the ASVs using four general techniques; impersonation, speech synthesis, voice conversion, and replay. The last technique is considered as a common and high potential tool for spoofing purposes since replay attacks are more accessible and require no technical knowledge from adversaries. In this study, we introduce a novel replay spoofing countermeasure for ASVs. Accordingly, we used the Constant Q Cepstral Coefficient (CQCC) features fed into an autoencoder to attain more informative features and to consider the noise information of spoofed utterances for discrimination purpose. Finally, different configurations of the Siamese network were used for the first time in this context for classification. The experiments performed on ASVspoof challenge 2019 dataset using Equal Error Rate (EER) and Tandem Detection Cost Function (t-DCF) as evaluation metrics show that the proposed system improved the results over the baseline by 10.73% and 0.2344 in terms of EER and t-DCF, respectively.


Key findings
The proposed system achieved an EER of 0.62% and a t-DCF of 0.0110 on the ASVspoof 2019 evaluation set, outperforming the CQCC-GMM baseline (EER 11.04%, t-DCF 0.2454). The autoencoder successfully extracted noise-aware and informative features, contributing to the improved discrimination. Additionally, the system demonstrated effectiveness with reduced training data, achieving comparable performance to the baseline using only 60% of the training set.
Approach
The system extracts Constant Q Cepstral Coefficient (CQCC) features from audio utterances. These features are then fed into an autoencoder for dimensionality reduction and to extract more informative features while considering noise. Finally, a Siamese network, composed of two identical Convolutional Neural Networks (CNNs), is used for classifying the processed features as either genuine or spoofed.
Datasets
ASVspoof 2019 (Physical Access scenario)
Model(s)
Autoencoder, Siamese Network (with Convolutional Neural Networks as its base components)
Author countries
Iran