The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

Authors: Weicheng Cai, Haiwei Wu, Danwei Cai, Ming Li

Published: 2019-07-05 03:00:05+00:00

Comment: Accepted for INTERSPEECH 2019

AI Summary

This paper details the DKU replay detection system for the ASVspoof 2019 challenge, focusing on developing spoofing countermeasures for automatic speaker recognition. The system leverages an utterance-level deep learning framework, incorporating data augmentation, various feature representations, residual neural network classification, and score-level fusion. Their best single system utilizes a residual neural network trained on speed-perturbed group delay gram, with performance significantly improved by fusing multiple systems.

Abstract

This paper describes our DKU replay detection system for the ASVspoof 2019 challenge. The goal is to develop spoofing countermeasure for automatic speaker recognition in physical access scenario. We leverage the countermeasure system pipeline from four aspects, including the data augmentation, feature representation, classification, and fusion. First, we introduce an utterance-level deep learning framework for anti-spoofing. It receives the variable-length feature sequence and outputs the utterance-level scores directly. Based on the framework, we try out various kinds of input feature representations extracted from either the magnitude spectrum or phase spectrum. Besides, we also perform the data augmentation strategy by applying the speed perturbation on the raw waveform. Our best single system employs a residual neural network trained by the speed-perturbed group delay gram. It achieves EER of 1.04% on the development set, as well as EER of 1.08% on the evaluation set. Finally, using the simple average score from several single systems can further improve the performance. EER of 0.24% on the development set and 0.66% on the evaluation set is obtained for our primary system.


Key findings
The best single system, utilizing a ResNet trained with speed-perturbed group delay gram, achieved EERs of 1.04% on the development set and 1.08% on the evaluation set. Fusion of several single systems further improved performance, yielding an EER of 0.24% on the development set and 0.66% on the evaluation set. This demonstrates the effectiveness of group delay gram, data augmentation, and system fusion for replay detection.
Approach
The authors developed an utterance-level deep learning framework using a deep convolutional neural network (ResNet) to directly output utterance-level scores. They explored diverse input feature representations derived from magnitude and phase spectra, applied speed perturbation for data augmentation, and finally employed simple average score-level fusion of multiple trained systems to boost overall performance.
Datasets
ASVspoof 2019 challenge (training set, development set, evaluation set)
Model(s)
Residual Neural Network (ResNet-34 backbone), GMM (for baseline comparison)
Author countries
China