Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks

Authors: Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li

Published: 2020-09-21 06:38:19+00:00

Comment: Accepted for publication in Interspeech 2020

AI Summary

This paper introduces a novel feature genuinization method for detecting synthetic speech attacks, addressing the challenge of unseen attack types that degrade existing countermeasure performance. The approach leverages the consistent distribution of genuine speech by training a CNN-based transformer using only genuine speech characteristics. This genuinization transformer, combined with a light CNN classifier, effectively amplifies the discriminative features between genuine and synthetic speech.

Abstract

Modern text-to-speech (TTS) and voice conversion (VC) systems produce natural sounding speech that questions the security of automatic speaker verification (ASV). This makes detection of such synthetic speech very important to safeguard ASV systems from unauthorized access. Most of the existing spoofing countermeasures perform well when the nature of the attacks is made known to the system during training. However, their performance degrades in face of unseen nature of attacks. In comparison to the synthetic speech created by a wide range of TTS and VC methods, genuine speech has a more consistent distribution. We believe that the difference between the distribution of synthetic and genuine speech is an important discriminative feature between the two classes. In this regard, we propose a novel method referred to as feature genuinization that learns a transformer with convolutional neural network (CNN) using the characteristics of only genuine speech. We then use this genuinization transformer with a light CNN classifier. The ASVspoof 2019 logical access corpus is used to evaluate the proposed method. The studies show that the proposed feature genuinization based LCNN system outperforms other state-of-the-art spoofing countermeasures, depicting its effectiveness for detection of synthetic speech attacks.


Key findings
The proposed feature genuinization based LCNN (FG-LCNN) system significantly outperforms baseline LCNN and ASVspoof 2019 challenge baselines, especially on evaluation sets with unseen attack types. A contrast experiment confirmed that learning from genuine speech characteristics is more effective than learning from spoofed speech. FG-LCNN also showed superior performance compared to other reported single anti-spoofing systems on the ASVspoof 2019 logical access corpus.
Approach
The proposed method, feature genuinization, trains a Convolutional Neural Network (CNN) as a transformer using only genuine speech features. This transformer then processes input speech features (log power spectrum from CQT), transforming them to enhance the distinction between genuine and spoofed speech. The transformed features are subsequently fed into a Light Convolutional Neural Network (LCNN) for classification.
Datasets
ASVspoof 2019 logical access corpus, VCTK (as the source for genuine examples in ASVspoof 2019)
Model(s)
Convolutional Neural Network (CNN) for the genuinization transformer, Light Convolutional Neural Network (LCNN) for the classifier
Author countries
Singapore, China