ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks

Authors: Cheng-I Lai, Nanxin Chen, Jesús Villalba, Najim Dehak

Published: 2019-04-01 21:47:00+00:00

Comment: Submitted to Interspeech 2019, Graz, Austria

AI Summary

This paper presents ASSERT, JHU's system submission to the ASVspoof 2019 Challenge, designed for anti-spoofing against text-to-speech, voice conversion, and replay attacks. ASSERT is a deep neural network-based pipeline comprising feature engineering, DNN models (variants of squeeze-excitation and residual networks), network optimization, and system combination. The system achieved significant relative improvements over baseline systems in both sub-challenges of ASVspoof 2019.

Abstract

We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT). Anti-spoofing has gathered more and more attention since the inauguration of the ASVspoof Challenges, and ASVspoof 2019 dedicates to address attacks from all three major types: text-to-speech, voice conversion, and replay. Built upon previous research work on Deep Neural Network (DNN), ASSERT is a pipeline for DNN-based approach to anti-spoofing. ASSERT has four components: feature engineering, DNN models, network optimization and system combination, where the DNN models are variants of squeeze-excitation and residual networks. We conducted an ablation study of the effectiveness of each component on the ASVspoof 2019 corpus, and experimental results showed that ASSERT obtained more than 93% and 17% relative improvements over the baseline systems in the two sub-challenges in ASVspooof 2019, ranking ASSERT one of the top performing systems. Code and pretrained models will be made publicly available.


Key findings
ASSERT obtained more than 93% relative improvement over baseline systems in the Physical Access (PA) sub-challenge and 17% relative improvement in the Logical Access (LA) sub-challenge. The fusion system ranked 3rd in PA and 14th in LA, demonstrating its effectiveness. Log-power magnitude spectra (logspec) generally outperformed CQCC features, and a unified feature map with overlap performed better than without overlap or whole utterances for most DNN models.
Approach
The ASSERT system tackles anti-spoofing through a DNN-based pipeline. It involves extracting acoustic features like CQCC and log-power magnitude spectra, processing them into unified feature maps or whole utterances, and then feeding them into various deep neural networks including SENet (34/50), Mean-Std ResNet, Dilated ResNet, and Attentive-Filtering Network. The models are optimized using binary or multi-class cross-entropy objectives and combined using logistic regression fusion.
Datasets
ASVspoof 2019 Challenge (ASVspoof 2019 corpus)
Model(s)
SENet34, SENet50, Mean-Std ResNet, Dilated ResNet, Attentive-Filtering Network
Author countries
USA