WaveFake: A Data Set to Facilitate Audio Deepfake Detection

Authors: Joel Frank, Lea Schönherr

Published: 2021-11-04 12:26:34+00:00

Comment: Accepted to NeurIPS 2021 (Benchmark and Dataset Track); Code: https://github.com/RUB-SysSec/WaveFake; Data: https://zenodo.org/record/5642694

AI Summary

This paper introduces WaveFake, a novel dataset comprising approximately 196 hours of generated audio from ten sample sets using six different state-of-the-art generative network architectures across two languages. It aims to address the lack of research in audio deepfake detection by providing a comprehensive dataset, an overview of audio signal processing techniques, and two baseline detection models. This resource facilitates further research and development in identifying synthetic audio signals.

Abstract

Deep generative modeling has the potential to cause significant harm to society. Recognizing this threat, a magnitude of research into detecting so-called Deepfakes has emerged. This research most often focuses on the image domain, while studies exploring generated audio signals have, so-far, been neglected. In this paper we make three key contributions to narrow this gap. First, we provide researchers with an introduction to common signal processing techniques used for analyzing audio signals. Second, we present a novel data set, for which we collected nine sample sets from five different network architectures, spanning two languages. Finally, we supply practitioners with two baseline models, adopted from the signal processing community, to facilitate further research in this area.


Key findings
The analysis of generated audio revealed subtle differences, particularly in higher frequencies and prosody, across various generative architectures. While neural network-based detectors (RawNet2) generally achieved better average performance, traditional GMM classifiers proved to be more robust and generalized better across different generative models and simulated real-world conditions like phone calls. High-frequency information was found to be indispensable for effective detection.
Approach
The authors address the gap in audio deepfake detection by creating a comprehensive dataset called WaveFake, consisting of audio generated by various state-of-the-art TTS models. They evaluate two baseline detection models, a Gaussian Mixture Model (GMM) and RawNet2, using different feature representations (LFCC, MFCC) to establish initial performance benchmarks and analyze their generalization capabilities across different generative architectures and real-world scenarios.
Datasets
WaveFake (novel dataset created by authors), LJSPEECH, JSUT, common voices
Model(s)
Gaussian Mixture Model (GMM), RawNet2 (CNN-GRU hybrid)
Author countries
Germany