Physiological-Physical Feature Fusion for Automatic Voice Spoofing Detection

Authors: Junxiao Xue, Hao Zhou, Yabo Wang

Published: 2021-09-01 03:32:22+00:00

AI Summary

This paper introduces a novel physiological-physical feature fusion method for automatic voice spoofing detection. The approach extracts physiological features from speech using a pre-trained convolutional neural network and physical features using SE-DenseNet or SE-Res2Net, then integrates them for classification. Experiments on the ASVspoof 2019 dataset demonstrate the model's effectiveness, showing significant improvements in tandem decision cost function (t-DCF) and equal error rate (EER) across both logical and physical access scenarios.

Abstract

Speaker verification systems have been used in many production scenarios in recent years. Unfortunately, they are still highly prone to different kinds of spoofing attacks such as voice conversion and speech synthesis, etc. In this paper, we propose a new method base on physiological-physical feature fusion to deal with voice spoofing attacks. This method involves feature extraction, a densely connected convolutional neural network with squeeze and excitation block (SE-DenseNet), multi-scale residual neural network with squeeze and excitation block (SE-Res2Net) and feature fusion strategies. We first pre-trained a convolutional neural network using the speaker's voice and face in the video as surveillance signals. It can extract physiological features from speech. Then we use SE-DenseNet and SE-Res2Net to extract physical features. Such a densely connection pattern has high parameter efficiency and squeeze and excitation block can enhance the transmission of the feature. Finally, we integrate the two features into the SE-Densenet to identify the spoofing attacks. Experimental results on the ASVspoof 2019 data set show that our model is effective for voice spoofing detection. In the logical access scenario, our model improves the tandem decision cost function (t-DCF) and equal error rate (EER) scores by 4% and 7%, respectively, compared with other methods. In the physical access scenario, our model improved t-DCF and EER scores by 8% and 10%, respectively.


Key findings
The model achieved superior performance on the ASVspoof 2019 challenge. In the logical access scenario, it improved t-DCF by 28% and EER by 11% compared to state-of-the-art methods. For the physical access scenario, the model showed improvements of 8% in t-DCF and 10% in EER.
Approach
The proposed method fuses physiological and physical features extracted from speech. Physiological features (4096-D 'face features') are derived from speech spectrograms using a pre-trained voice encoder (initially trained with speaker voice and face from video). Physical features are extracted from speech using SE-DenseNet with LFCC for logical access and SE-Res2Net with CQT for physical access. These features are then fused via concatenation for logical access and weighted averaging for physical access, followed by a classification network.
Datasets
ASVspoof 2019, VCTK basic corpus, AVSpeech datasets
Model(s)
Voice Encoder (CNN), SE-DenseNet, SE-Res2Net, DenseNet (variant for classification module), VGG-Face model (for pre-training surveillance signal)
Author countries
China