Advanced Signal Analysis in Detecting Replay Attacks for Automatic Speaker Verification Systems

Authors: Lee Shih Kuang

Published: 2024-03-02 08:19:58+00:00

Comment: https://github.com/shihkuanglee/ADFA

AI Summary

This study introduces novel signal analysis methods: Arbitrary Analysis (AA), Mel Scale Analysis (MA), and Constant Q Analysis (CQA) for replay speech detection in automatic speaker verification (ASV) systems. Inspired by the Fourier inversion formula, these methods offer new perspectives by using alternative sinusoidal sequence groups. They demonstrate superior efficacy and/or efficiency compared to conventional methods on ASVspoof 2019 & 2021 PA databases, especially when integrated with the Temporal Autocorrelation of Speech (TAC) feature.

Abstract

This study proposes novel signal analysis methods for replay speech detection in automatic speaker verification (ASV) systems. The proposed methods -- arbitrary analysis (AA), mel scale analysis (MA), and constant Q analysis (CQA) -- are inspired by the calculation of the Fourier inversion formula. These methods introduce new perspectives in signal analysis for replay speech detection by employing alternative sinusoidal sequence groups. The efficacy of the proposed methods is examined on the ASVspoof 2019 \\& 2021 PA databases with experiments, and confirmed by the performance of systems that incorporated the proposed methods; the successful integration of the proposed methods and a speech feature that calculates temporal autocorrelation of speech (TAC) from complex spectra strongly confirms it. Moreover, the proposed CQA and MA methods show their superiority to the conventional methods on efficiency (approximately 2.36 times as fast compared to the conventional constant Q transform (CQT) method) and efficacy, respectively, in analyzing speech signals, making them promising to utilize in music and speech processing works.


Key findings
The Constant Q Analysis (CQA) method significantly improved efficiency, being approximately 2.36 times faster than the conventional Constant Q Transform (CQT) method. The Mel Scale Analysis (MA) method demonstrated superior efficacy, particularly in capturing human speech characteristics and performing well on unseen conditions (ASVspoof 2021-eval). The proposed methods, especially when integrated with the Temporal Autocorrelation of Speech (TAC) feature, consistently showed strong performance and improved replay speech detection capabilities.
Approach
The authors propose three signal analysis methods (AA, MA, CQA) inspired by the Fourier inversion formula to generate speech spectra using alternative sinusoidal sequence groups. These novel spectral representations are then utilized as speech features for training a Light Convolutional Neural Network (LCNN) to detect replay attacks in ASV systems.
Datasets
ASVspoof 2019 PA, ASVspoof 2021 PA
Model(s)
Light Convolutional Neural Network (LCNN)
Author countries
Taiwan