Investigating self-supervised front ends for speech spoofing countermeasures

View on arXiv ← Back to list

Authors: Xin Wang, Junichi Yamagishi

Published: 2021-11-15 12:52:50+00:00

AI Summary

This paper investigates using pre-trained self-supervised speech models as front-ends for speech spoofing countermeasures (CMs). The authors find that fine-tuning a well-chosen pre-trained front-end with a shallow or deep neural network back-end significantly improves performance on multiple datasets compared to a baseline system.

Abstract

Self-supervised speech model is a rapid progressing research topic, and many pre-trained models have been released and used in various down stream tasks. For speech anti-spoofing, most countermeasures (CMs) use signal processing algorithms to extract acoustic features for classification. In this study, we use pre-trained self-supervised speech models as the front end of spoofing CMs. We investigated different back end architectures to be combined with the self-supervised front end, the effectiveness of fine-tuning the front end, and the performance of using different pre-trained self-supervised models. Our findings showed that, when a good pre-trained front end was fine-tuned with either a shallow or a deep neural network-based back end on the ASVspoof 2019 logical access (LA) training set, the resulting CM not only achieved a low EER score on the 2019 LA test set but also significantly outperformed the baseline on the ASVspoof 2015, 2021 LA, and 2021 deepfake test sets. A sub-band analysis further demonstrated that the CM mainly used the information in a specific frequency band to discriminate the bona fide and spoofed trials across the test sets.

Key findings

Fine-tuning a pre-trained self-supervised front-end, especially those trained on diverse speech corpora, significantly outperforms the baseline on multiple datasets. A sub-band analysis reveals that the improved CMs rely on information in the 0.1-2.4 kHz frequency band.

Approach

The approach uses pre-trained self-supervised speech models (Wav2vec 2.0 and HuBERT) as the front-end of a spoofing CM, exploring different back-end architectures and the impact of fine-tuning the front-end. The effectiveness of various pre-trained models is also evaluated.

Datasets

ASVspoof 2015, ASVspoof 2019 logical access (LA), ASVspoof 2021 LA, ASVspoof 2021 deepfake (DF)

Model(s)

Wav2vec 2.0, HuBERT, LCNN-Bi-LSTM (baseline), various shallow and deep neural network back-ends.

Author countries

Japan

← Previous