Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models

Authors: Borodin Kirill Nikolayevich, Kudryavtsev Vasiliy Dmitrievich, Mkrtchian Grach Maratovich, Gorodnichev Mikhail Genadievich, Korzh Dmitrii Sergeevich

Published: 2024-06-27 15:08:51+00:00

AI Summary

This paper presents an Automatic Speaker Verification (ASV) system designed to extract speaker embeddings, capturing characteristics like pitch, energy, and phoneme duration. While intended for a multi-voice TTS pipeline, the system was primarily evaluated for identifying original speakers in voice-converted audio within the SSTC challenge. It demonstrated an Equal Error Rate (EER) of 20.669% in this deepfake detection task.

Abstract

One of the most crucial components in the field of biometric security is the automatic speaker verification system, which is based on the speaker's voice. It is possible to utilise ASVs in isolation or in conjunction with other AI models. In the contemporary era, the quality and quantity of neural networks are increasing exponentially. Concurrently, there is a growing number of systems that aim to manipulate data through the use of voice conversion and text-to-speech models. The field of voice biometrics forgery is aided by a number of challenges, including SSTC, ASVSpoof, and SingFake. This paper presents a system for automatic speaker verification. The primary objective of our model is the extraction of embeddings from the target speaker's audio in order to obtain information about important characteristics of his voice, such as pitch, energy, and the duration of phonemes. This information is used in our multivoice TTS pipeline, which is currently under development. However, this model was employed within the SSTC challenge to verify users whose voice had undergone voice conversion, where it demonstrated an EER of 20.669.


Key findings
The ASV system achieved an EER of 20.669% for identifying original speakers in voice-converted audio within the SSTC challenge, notably through an ensembling technique. Additionally, the extracted ASV embeddings proved beneficial for a TTS duration predictor, significantly improving its accuracy by retaining phoneme length information, with utterance embeddings showing the most optimal performance.
Approach
The system employs a multi-encoder architecture, utilizing a 10-block Specblock for Constant-Q Transform (CQT), and Vision Transformers (ViT) for Mel-spectrogram and Pitch spectrograms. The features from these encoders are concatenated, passed through a fully-connected layer to form speaker embeddings, and trained using the Additive Margin Softmax (AM-Softmax) loss function. For the final deepfake detection task in the SSTC challenge, an ensemble of this model with a competition baseline model was used.
Datasets
Kaggle Speaker Recognition Dataset, CMU ARCTIC, LibriTTS-R, Librispeech (train-clean-100, train-clean-360, dev.clean, test.clean), VoxCeleb 2 dev, VoxCeleb 1 test
Model(s)
Specblock (complex convolutional layers), Vision Transformer (ViT), Additive Margin Softmax (AM-Softmax) loss function, Model Ensembling
Author countries
Russia