Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection

Authors: Yongqiang Dou, Haocheng Yang, Maolin Yang, Yanyan Xu, Dengfeng Ke

Published: 2020-06-25 17:06:47+00:00

Comment: The 25th International Conference on Pattern Recognition (ICPR2020)

AI Summary

This paper proposes D3M, a novel method for replay attack detection that addresses data discrepancy by introducing a balanced focal loss function. This loss dynamically scales sample contributions during training, prioritizing indistinguishable samples. The approach also integrates a fusion of complementary magnitude-based (STFT-gram, CQT-gram) and phase-based (MGD-gram) features, demonstrating superior performance on the ASVspoof2019 dataset.

Abstract

It becomes urgent to design effective anti-spoofing algorithms for vulnerable automatic speaker verification systems due to the advancement of high-quality playback devices. Current studies mainly treat anti-spoofing as a binary classification problem between bonafide and spoofed utterances, while lack of indistinguishable samples makes it difficult to train a robust spoofing detector. In this paper, we argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process, to make correct discrimination a top priority. Therefore, to mitigate the data discrepancy between training and inference, we propose D3M, to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. Besides, in the experiments, we select three kinds of features that contain both magnitude-based and phase-based information to form complementary and informative features. Experimental results on the ASVspoof2019 dataset demonstrate the superiority of the proposed methods by comparison between our systems and top-performing ones. Systems trained with the balanced focal loss perform significantly better than conventional cross-entropy loss. With complementary features, our fusion system with only three kinds of features outperforms other systems containing five or more complex single models by 22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124 and 0.55% respectively. Furthermore, we present and discuss the evaluation results on real replay data apart from the simulated ASVspoof2019 data, indicating that research for anti-spoofing still has a long way to go. Source code, analysis data, and other details are publicly available at https://github.com/asvspoof/D3M.


Key findings
Systems trained with the balanced focal loss significantly outperform those using conventional cross-entropy loss. The proposed fusion system, leveraging three complementary features, achieves state-of-the-art performance on the ASVspoof2019 evaluation set with a min-tDCF of 0.0124 and an EER of 0.55%, surpassing more complex top-performing systems. However, deep learning-based methods exhibited unexpected performance degradation on the real-world Real-PA dataset compared to simulated data and conventional GMMs.
Approach
The approach, D3M, mitigates data discrepancy in replay attack detection by using a balanced focal loss function. This loss dynamically re-weights samples during training, focusing more on hard-to-classify, indistinguishable samples. The system utilizes a ResNet-based end-to-end model, taking as input a fusion of three complementary features: Modified Group Delay (MGD) gram, Short-Time Fourier Transform (STFT) gram, and Constant Q Transform (CQT) gram.
Datasets
ASVspoof2019 (PA train set, PA dev set, PA eval set), Real-PA dataset
Model(s)
ResNet
Author countries
China