Physiological-Physical Feature Fusion for Automatic Voice Spoofing Detection

View on arXiv ← Back to list

Authors: Junxiao Xue, Hao Zhou, Yabo Wang

Published: 2021-09-01 03:32:22+00:00

AI Summary

This paper proposes a novel voice spoofing detection method that fuses physiological and physical features extracted from audio and video. The method uses a pre-trained convolutional neural network for physiological feature extraction (from speaker's face in video) and SE-DenseNet/SE-Res2Net for physical feature extraction (from speech), followed by feature fusion and classification.

Abstract

Speaker verification systems have been used in many production scenarios in recent years. Unfortunately, they are still highly prone to different kinds of spoofing attacks such as voice conversion and speech synthesis, etc. In this paper, we propose a new method base on physiological-physical feature fusion to deal with voice spoofing attacks. This method involves feature extraction, a densely connected convolutional neural network with squeeze and excitation block (SE-DenseNet), multi-scale residual neural network with squeeze and excitation block (SE-Res2Net) and feature fusion strategies. We first pre-trained a convolutional neural network using the speaker's voice and face in the video as surveillance signals. It can extract physiological features from speech. Then we use SE-DenseNet and SE-Res2Net to extract physical features. Such a densely connection pattern has high parameter efficiency and squeeze and excitation block can enhance the transmission of the feature. Finally, we integrate the two features into the SE-Densenet to identify the spoofing attacks. Experimental results on the ASVspoof 2019 data set show that our model is effective for voice spoofing detection. In the logical access scenario, our model improves the tandem decision cost function (t-DCF) and equal error rate (EER) scores by 4% and 7%, respectively, compared with other methods. In the physical access scenario, our model improved t-DCF and EER scores by 8% and 10%, respectively.

Key findings

The proposed model significantly improved t-DCF and EER scores compared to other methods in both logical and physical access scenarios of the ASVspoof 2019 dataset. Specifically, improvements of 28% and 11% in t-DCF and EER were observed in the logical access scenario, and 8% and 10% in the physical access scenario.

Approach

The approach involves pre-training a convolutional neural network on audio-visual data to extract physiological features from speech. Physical features are extracted using SE-DenseNet and SE-Res2Net. These features are then fused (concatenation for logical access, weighted average for physical access) and fed into a classifier for spoofing detection.

Datasets

ASVspoof 2019 dataset (logical and physical access scenarios)

Model(s)

SE-DenseNet, SE-Res2Net, pre-trained convolutional neural network (for face feature extraction from video)

Author countries

China

← Previous