Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture

View on arXiv ← Back to list

Authors: Lei Wang, Benedict Yeoh, Jun Wah Ng

Published: 2022-10-07 14:30:13+00:00

AI Summary

This paper proposes a new SE-Res2Net-Conformer architecture for improved synthetic voice detection and re-formulates audio splicing detection to focus on boundary identification. The proposed architecture combines the strengths of Res2Net, Conformer blocks, and a deep learning approach to achieve better performance on both tasks.

Abstract

Synthetic voice and splicing audio clips have been generated to spoof Internet users and artificial intelligence (AI) technologies such as voice authentication. Existing research work treats spoofing countermeasures as a binary classification problem: bonafide vs. spoof. This paper extends the existing Res2Net by involving the recent Conformer block to further exploit the local patterns on acoustic features. Experimental results on ASVspoof 2019 database show that the proposed SE-Res2Net-Conformer architecture is able to improve the spoofing countermeasures performance for the logical access scenario. In addition, this paper also proposes to re-formulate the existing audio splicing detection problem. Instead of identifying the complete splicing segments, it is more useful to detect the boundaries of the spliced segments. Moreover, a deep learning approach can be used to solve the problem, which is different from the previous signal processing techniques.

Key findings

The SE-Res2Net-Conformer architecture significantly improved synthetic voice detection performance on the ASVspoof 2019 dataset, achieving lower EER and t-DCF compared to baselines. The proposed boundary-focused approach to audio splicing detection showed effectiveness, with performance improved by adding artificial noise to the TIMIT dataset.

Approach

The authors extend the SE-Res2Net architecture by adding Conformer blocks to better exploit local and temporal patterns in acoustic features for synthetic voice detection. For audio splicing detection, they propose a novel approach focusing on identifying splicing boundaries instead of entire segments, using a sliding window and deep learning.

Datasets

ASVspoof 2019 (Logical Access) and a custom-created spliced TIMIT dataset with added artificial noise.

Model(s)

SE-Res2Net34 and SE-Res2Net34-Conformer.

Author countries

UNKNOWN

← Previous