Investigating self-supervised front ends for speech spoofing countermeasures

Authors: Xin Wang, Junichi Yamagishi

Published: 2021-11-15 12:52:50+00:00

Comment: V3: added sub-band analysis, submitted to ISCA Odyssey2022; V2: added min tDCF results on 2019 and 2021 LA. EERs on LA 2021 were slightly updated to fix one glitch in the score file. EERs and min tDCFs on 2021 LA and DF can be computed using the latest official code https://github.com/asvspoof-challenge/2021. Work in progress. Feedback is welcome!

AI Summary

This paper investigates using pre-trained self-supervised speech models as front ends for speech spoofing countermeasures (CMs). It explores different back-end architectures, the benefits of fine-tuning the front end, and the performance of various self-supervised models. The study demonstrates that fine-tuning a well-chosen pre-trained self-supervised front end significantly improves spoofing detection generalizability across diverse ASVspoof datasets.

Abstract

Self-supervised speech model is a rapid progressing research topic, and many pre-trained models have been released and used in various down stream tasks. For speech anti-spoofing, most countermeasures (CMs) use signal processing algorithms to extract acoustic features for classification. In this study, we use pre-trained self-supervised speech models as the front end of spoofing CMs. We investigated different back end architectures to be combined with the self-supervised front end, the effectiveness of fine-tuning the front end, and the performance of using different pre-trained self-supervised models. Our findings showed that, when a good pre-trained front end was fine-tuned with either a shallow or a deep neural network-based back end on the ASVspoof 2019 logical access (LA) training set, the resulting CM not only achieved a low EER score on the 2019 LA test set but also significantly outperformed the baseline on the ASVspoof 2015, 2021 LA, and 2021 deepfake test sets. A sub-band analysis further demonstrated that the CM mainly used the information in a specific frequency band to discriminate the bona fide and spoofed trials across the test sets.


Key findings
Fine-tuning a self-supervised front end, especially those pre-trained on diverse speech corpora, significantly reduced Equal Error Rates (EERs) across all ASVspoof test sets, outperforming the LFCC-based baseline. The choice of back-end architecture became less critical when the front end was fine-tuned. A sub-band analysis revealed that self-supervised CMs primarily relied on the 0.1-2.4 kHz frequency band for discrimination, offering better generalization than the baseline which focused on high-frequency information.
Approach
The authors propose a countermeasure architecture where pre-trained self-supervised speech models (e.g., Wav2vec 2.0, HuBERT) serve as the front end for extracting acoustic features. These features are then fed into different back-end architectures, ranging from deep (LCNN-LSTM-GAP-FC) to shallow (FC-GAP). The study investigates the impact of fine-tuning the self-supervised front end during training versus keeping it fixed.
Datasets
ASVspoof 2019 logical access (LA) training set, ASVspoof 2019 LA test set, ASVspoof 2015 test set, ASVspoof 2021 LA test set, ASVspoof 2021 deepfake (DF) test set
Model(s)
Wav2vec 2.0 (W2V-XLSR, W2V-Large2, W2V-Large1, W2V-Small), HuBERT (HuBERT-XL), LCNN, Bi-LSTM, Global Average Pooling (GAP), Fully Connected (FC) layers, LFCC (Linear Frequency Cepstral Coefficients - for baseline).
Author countries
Japan