SingFake: Singing Voice Deepfake Detection

View on arXiv ← Back to list

Authors: Yongyi Zang, You Zhang, Mojtaba Heydari, Zhiyao Duan

Published: 2023-09-14 08:49:05+00:00

AI Summary

This paper introduces the task of singing voice deepfake detection and presents SingFake, the first in-the-wild dataset for this task, comprising 28.93 hours of real and 29.40 hours of deepfake singing voice clips. The authors evaluate existing speech deepfake detection systems on this dataset, demonstrating their limitations and highlighting the need for specialized methods for singing voice deepfake detection.

Abstract

The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/validation/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these systems lag significantly behind their performance on speech test data. When trained on SingFake, either using separated vocal tracks or song mixtures, these systems show substantial improvement. However, our evaluations also identify challenges associated with unseen singers, communication codecs, languages, and musical contexts, calling for dedicated research into singing voice deepfake detection. The SingFake dataset and related resources are available at https://www.singfake.org/.

Key findings

Existing speech deepfake detection systems perform poorly on singing voice data. Retraining these systems on the SingFake dataset significantly improves their performance, especially when using separated vocal tracks. However, challenges remain in generalizing to unseen singers, languages, and musical contexts.

Approach

The authors curated the SingFake dataset of real and deepfake singing voice clips. They evaluated four state-of-the-art speech deepfake detection models on this dataset, both pre-trained on speech data and then re-trained on SingFake (using both full song mixtures and separated vocals). The results showed that retraining on SingFake significantly improved performance.

Datasets

SingFake dataset (28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers); ASVspoof2019 (for initial model training)

Model(s)

AASIST, Spectrogram+ResNet, LFCC+ResNet, Wav2vec2+AASIST

Author countries

USA

← Previous