SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge

View on arXiv ← Back to list

Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, Zhiyao Duan

Published: 2024-08-28 20:48:04+00:00

AI Summary

The inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices. The challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD), with the top team in the CtrSVDD track achieving a 1.65% equal error rate.

Abstract

With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD track utilizes publicly available singing vocal data to generate deepfakes using state-of-the-art singing voice synthesis and conversion systems. Meanwhile, the WildSVDD track expands upon the existing SingFake dataset, which includes data sourced from popular user-generated content websites. For the CtrSVDD track, we received submissions from 47 teams, with 37 surpassing our baselines and the top team achieving a 1.65% equal error rate. For the WildSVDD track, we benchmarked the baselines. This paper reviews these results, discusses key findings, and outlines future directions for SVDD research.

Key findings

In the CtrSVDD track, 37 of 47 teams surpassed the baselines, with the top team achieving an EER of 1.65%. The WildSVDD track received no submissions, likely due to copyright concerns and data preparation challenges. Performance varied significantly across different deepfake generation methods, highlighting the need for more robust and generalized models.

Approach

The challenge uses two tracks: CtrSVDD, with controlled, clean vocals to generate deepfakes using various SVS and SVC systems; and WildSVDD, expanding the SingFake dataset with in-the-wild deepfakes from user-generated content. The evaluation metric is Equal Error Rate (EER).

Datasets

CtrSVDD (created for the challenge using publicly available singing datasets like Opencpop, M4Singer, KiSing, etc., and various SVS/SVC systems); WildSVDD (expanded SingFake dataset from user-generated content websites).

Model(s)

Various models/architectures were used by participating teams; baseline models included an AASIST-based system and an LFCC-based system. Many top-performing teams utilized self-supervised learning (SSL) frontends (e.g., wav2vec2 XLSR) and ensemble learning.

Author countries

USA, Japan

← Previous