Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

View on arXiv ← Back to list

Authors: Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, Qiongqiong Wang

Published: 2024-09-03 21:28:45+00:00

AI Summary

This research proposes ensemble methods using speech foundation models for singing voice deepfake detection, achieving a leading 1.79% pooled equal error rate (EER) on the CtrSVDD evaluation set. A novel Squeeze-and-Excitation Aggregation (SEA) method is introduced to effectively integrate features from these models, improving performance beyond individual systems.

Abstract

This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti-spoofing systems. We also introduce a novel Squeeze-and-Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting deepfake singing voices. The codes can be accessed at https://github.com/Anmol2059/SVDD2024.

Key findings

The proposed SEA method significantly improved individual model performance, achieving a 2.70% EER. Ensemble methods further enhanced results, reaching a remarkable 1.79% pooled EER on the CtrSVDD evaluation set, outperforming baseline systems and individual models. The 'parallel: (1)+(2)' RawBoost augmentation configuration consistently yielded the best results.

Approach

The approach uses an ensemble of models built upon speech foundation models (WavLM, wav2vec2) with a novel Squeeze-and-Excitation Aggregation (SEA) method for feature integration. Data augmentation techniques (RawBoost) are applied to enhance robustness, and the AASIST backend processes features before classification.

Datasets

CtrSVDD training and development datasets, JVS, Kiritan, Ofutan-P2, and Oniku datasets.

Model(s)

WavLM, wav2vec2, RawNet2, AASIST (backend model)

Author countries

Singapore, India

← Previous