SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge

Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, Zhiyao Duan

Published: 2024-08-28 20:48:04+00:00

Comment: 6 pages, Accepted by 2024 IEEE Spoken Language Technology Workshop (SLT 2024)

AI Summary

The SVDD 2024 Challenge was launched to advance research in detecting AI-generated singing voices, featuring two tracks: a controlled setting (CtrSVDD) and an in-the-wild scenario (WildSVDD). The challenge successfully attracted 47 submissions for CtrSVDD, with 37 teams surpassing baselines and the top team achieving a 1.65% equal error rate. This paper reviews the results, discusses key findings, and outlines future directions for singing voice deepfake detection research.

Abstract

With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD track utilizes publicly available singing vocal data to generate deepfakes using state-of-the-art singing voice synthesis and conversion systems. Meanwhile, the WildSVDD track expands upon the existing SingFake dataset, which includes data sourced from popular user-generated content websites. For the CtrSVDD track, we received submissions from 47 teams, with 37 surpassing our baselines and the top team achieving a 1.65% equal error rate. For the WildSVDD track, we benchmarked the baselines. This paper reviews these results, discusses key findings, and outlines future directions for SVDD research.


Key findings
The CtrSVDD track demonstrated significant progress in singing voice deepfake detection, with 37 out of 47 teams outperforming baseline systems and the top team achieving an impressive 1.65% EER. While self-supervised learning features and ensemble methods were common strategies among top performers, challenges persist in generalization to unseen generation methods and out-of-domain commercial SVS systems. No submissions were received for the WildSVDD track, highlighting difficulties in addressing in-the-wild scenarios, possibly due to data preparation complexities and copyright concerns.
Approach
The SVDD Challenge is structured into two tracks: CtrSVDD, using clean, unaccompanied vocals generated by various SVS/SVC systems, and WildSVDD, expanding the SingFake dataset with deepfakes from online media, often with background music. Participants' systems were evaluated using the Equal Error Rate (EER), with baselines established for comparison in both tracks to drive advancements in the field.
Datasets
CtrSVDD database (derived from Opencpop, M4Singer, KiSing, ACE-Studio, Ofuton-P, Oniku Kurumi, Kiritan, JVS-MuSiC using 14 SVS/SVC systems), WildSVDD (expanded SingFake dataset). Additional datasets used by top teams include HiFi-TTS, OpenSinger, CSD, itako-Singing, JSUT-Song, Namine ritsu utagoe db, no7-singing, PJS, PoPCS, URS-ing.
Model(s)
Baseline models included AASIST-based systems (using raw waveforms or LFCC features) and XLS-R based systems. Top-performing teams commonly utilized self-supervised learning (SSL) frontends like wav2vec2 XLSR, Chinese HuBERT, and WavLM, often coupled with backends such as ResNet and AASIST, and employed ensemble learning and adversarial training.
Author countries
USA, Japan