Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture

Authors: Lei Wang, Benedict Yeoh, Jun Wah Ng

Published: 2022-10-07 14:30:13+00:00

Comment: Accepted by the 13th International Symposium on Chinese Spoken Language Processing (ISCSLP 2022)

AI Summary

This paper introduces the SE-Res2Net-Conformer architecture to enhance synthetic voice detection by better exploiting local acoustic patterns, showing improved performance on the ASVspoof 2019 database. Additionally, it re-formulates the audio splicing detection problem to focus on identifying splicing segment boundaries, proposing a deep learning approach for this task.

Abstract

Synthetic voice and splicing audio clips have been generated to spoof Internet users and artificial intelligence (AI) technologies such as voice authentication. Existing research work treats spoofing countermeasures as a binary classification problem: bonafide vs. spoof. This paper extends the existing Res2Net by involving the recent Conformer block to further exploit the local patterns on acoustic features. Experimental results on ASVspoof 2019 database show that the proposed SE-Res2Net-Conformer architecture is able to improve the spoofing countermeasures performance for the logical access scenario. In addition, this paper also proposes to re-formulate the existing audio splicing detection problem. Instead of identifying the complete splicing segments, it is more useful to detect the boundaries of the spliced segments. Moreover, a deep learning approach can be used to solve the problem, which is different from the previous signal processing techniques.


Key findings
The proposed SE-Res2Net-Conformer achieved improved synthetic voice detection performance, reducing EER to 1.85% and t-DCF to 0.06 on the ASVspoof 2019 LA Eval set. For audio splicing detection, models demonstrated higher accuracy on noisy data, capturing inconsistencies in ambient noise, with SE-Res2Net34-Conformer showing marginal improvements over SE-Res2Net34.
Approach
The paper proposes the SE-Res2Net-Conformer architecture, which cascades Conformer blocks after SE-Res2Net to extract more distinguishable local and temporal acoustic patterns for synthetic voice detection. For audio splicing, it re-defines the problem to detect splicing segment boundaries using a deep learning approach on acoustic features extracted from sliding audio chunks.
Datasets
ASVspoof 2019 LA, TIMIT (custom spliced database)
Model(s)
SE-Res2Net-Conformer, SE-Res2Net34, SE-Res2Net50
Author countries
Singapore