Replay Spoofing Countermeasure Using Autoencoder and Siamese Network on ASVspoof 2019 Challenge

View on arXiv ← Back to list

Authors: Mohammad Adiban, Hossein Sameti, Saeedreza Shehnepoor

Published: 2019-10-29 16:03:04+00:00

AI Summary

This paper presents a novel replay spoofing countermeasure for Automatic Speaker Verification (ASV) systems. It uses Constant Q Cepstral Coefficients (CQCC) features processed by an autoencoder to enhance information and incorporate noise information, followed by a Siamese network for classification, achieving significant improvements over the baseline system.

Abstract

Automatic Speaker Verification (ASV) is the process of identifying a person based on the voice presented to a system. Different synthetic approaches allow spoofing to deceive ASV systems (ASVs), whether using techniques to imitate a voice or recunstruct the features. Attackers try to beat up the ASVs using four general techniques; impersonation, speech synthesis, voice conversion, and replay. The last technique is considered as a common and high potential tool for spoofing purposes since replay attacks are more accessible and require no technical knowledge from adversaries. In this study, we introduce a novel replay spoofing countermeasure for ASVs. Accordingly, we used the Constant Q Cepstral Coefficient (CQCC) features fed into an autoencoder to attain more informative features and to consider the noise information of spoofed utterances for discrimination purpose. Finally, different configurations of the Siamese network were used for the first time in this context for classification. The experiments performed on ASVspoof challenge 2019 dataset using Equal Error Rate (EER) and Tandem Detection Cost Function (t-DCF) as evaluation metrics show that the proposed system improved the results over the baseline by 10.73% and 0.2344 in terms of EER and t-DCF, respectively.

Key findings

The proposed system significantly outperforms the baseline, improving the Equal Error Rate (EER) by 10.73% and the Tandem Detection Cost Function (t-DCF) by 0.2344. The use of a Siamese network and an autoencoder for feature extraction proves effective in improving the accuracy of replay spoofing detection. The system also shows robustness with only 60% of the training data.

Approach

The approach uses CQCC features as input to an autoencoder for feature extraction, aiming to capture noise information crucial for discriminating between spoofed and genuine utterances. These enhanced features are then fed into a Siamese network, a novel approach in this context, for classification of spoofed and genuine speech.

Datasets

ASVspoof 2019 challenge dataset (Physical Access scenario)

Model(s)

Autoencoder, Siamese network (with CNNs as subnetworks), CQCC feature extractor

Author countries

Iran

← Previous