Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

Authors: Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, Qiongqiong Wang

Published: 2024-09-03 21:28:45+00:00

Comment: Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024

AI Summary

This work details an approach for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024, achieving a leading 1.79% pooled equal error rate (EER). The authors explore ensemble methods utilizing speech foundation models and introduce a novel Squeeze-and-Excitation Aggregation (SEA) method to efficiently integrate features, outperforming individual systems.

Abstract

This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti-spoofing systems. We also introduce a novel Squeeze-and-Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting deepfake singing voices. The codes can be accessed at https://github.com/Anmol2059/SVDD2024.


Key findings
The best individual model, using WavLM with the proposed SEA and parallel RawBoost augmentation, achieved an EER of 2.70%. The final ensemble system achieved a remarkable 1.79% pooled EER, demonstrating that combining diverse models significantly enhances robustness and accuracy compared to individual systems.
Approach
The authors employ ensemble methods combining various individual models, which leverage speech foundation models (WavLM, wav2vec2) and RawNet2-style SincConv layers as frontends. They incorporate RawBoost data augmentation and propose a novel Squeeze-and-Excitation Aggregation (SEA) method for dynamically weighting and integrating features from different layers, with AASIST serving as the backend architecture.
Datasets
CtrSVDD track official training and development datasets, JVS, Kiritan, Ofutan-P2, Oniku3
Model(s)
WavLM, wav2vec2, RawNet2-style SincConv layers, AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks), Proposed Squeeze-and-Excitation Aggregation (SEA)
Author countries
Singapore, India