Spoofing-Robust Speaker Verification Using Parallel Embedding Fusion: BTU Speech Group's Approach for ASVspoof5 Challenge

Authors: Oğuzhan Kurnaz, Selim Can Demirtaş, Aykut Büker, Jagabandhu Mishra, Cemal Hanilçi

Published: 2024-08-28 15:48:03+00:00

Comment: Accepted in ASVspoof2024 workshop

Journal Ref: 10.21437/ASVspoof.2024

AI Summary

This paper introduces BTU Speech Group's parallel network-based spoofing-aware speaker verification (SASV) system for the ASVspoof5 Challenge. The system integrates ASV and CM models through embedding fusion, employing a novel parallel DNN structure that processes different input embedding combinations independently. The final SASV probability is derived by averaging scores from these parallel networks, enhancing robustness against spoofing attacks.

Abstract

This paper introduces the parallel network-based spoofing-aware speaker verification (SASV) system developed by BTU Speech Group for the ASVspoof5 Challenge. The SASV system integrates ASV and CM systems to enhance security against spoofing attacks. Our approach employs score and embedding fusion from ASV models (ECAPA-TDNN, WavLM) and CM models (AASIST). The fused embeddings are processed using a simple DNN structure, optimizing model performance with a combination of recently proposed a-DCF and BCE losses. We introduce a novel parallel network structure where two identical DNNs, fed with different inputs, independently process embeddings and produce SASV scores. The final SASV probability is derived by averaging these scores, enhancing robustness and accuracy. Experimental results demonstrate that the proposed parallel DNN structure outperforms traditional single DNN methods, offering a more reliable and secure speaker verification system against spoofing attacks.


Key findings
The proposed parallel DNN structure significantly outperforms traditional single DNN methods for embedding fusion, with the best individual system achieving an a-DCF of 0.2492 on the progress set. Score fusion of two top-performing parallel models (ECAPA-TDNN+AASIST and WavLM+AASIST) further improved performance, yielding the best result of 0.2129 a-DCF on the progress set. The study also found that WavLM embeddings effectively capture speaker-specific information despite not being initially trained for speaker recognition.
Approach
The approach fuses embeddings from ASV models (ECAPA-TDNN, WavLM) and a CM model (AASIST). A novel parallel network structure uses two identical DNNs, each fed with different combinations of these fused embeddings, to independently produce SASV scores. These individual scores are then averaged to yield the final SASV probability, optimized with a combined a-DCF and BCE loss.
Datasets
ASVspoof5, Voxceleb2
Model(s)
ECAPA-TDNN, WavLM, AASIST, Deep Neural Networks (DNNs)
Author countries
Turkey, Finland