Diffuse or Confuse: A Diffusion Deepfake Speech Dataset

View on arXiv ← Back to list

Authors: Anton Firc, Kamil Malinka, Petr Hanáček

Published: 2024-10-09 11:51:08+00:00

AI Summary

This paper introduces a new dataset of diffusion-generated deepfake speech, comparing its quality and detectability to non-diffusion deepfakes. The findings show that detection performance is comparable across both types, with variability depending on the detector architecture, suggesting diffusion models don't pose a significantly greater threat to current detection systems.

Abstract

Advancements in artificial intelligence and machine learning have significantly improved synthetic speech generation. This paper explores diffusion models, a novel method for creating realistic synthetic speech. We create a diffusion dataset using available tools and pretrained models. Additionally, this study assesses the quality of diffusion-generated deepfakes versus non-diffusion ones and their potential threat to current deepfake detection systems. Findings indicate that the detection of diffusion-based deepfakes is generally comparable to non-diffusion deepfakes, with some variability based on detector architecture. Re-vocoding with diffusion vocoders shows minimal impact, and the overall speech quality is comparable to non-diffusion methods.

Key findings

Detection of diffusion-based deepfakes is comparable to non-diffusion deepfakes, with some variation based on the detector used. Re-vocoding with diffusion vocoders has minimal impact on detection. The overall speech quality of diffusion-generated speech is comparable to non-diffusion methods, although diffusion introduces more noise.

Approach

The authors generated a dataset of deepfake speech using both diffusion and non-diffusion models. They then evaluated the detectability of these deepfakes using three state-of-the-art audio deepfake detection systems, comparing the equal error rates (EER) and analyzing speech quality metrics.

Datasets

LJSpeech dataset, ASVSpoof2019 LA dataset (for training detectors)

Model(s)

DiffGAN-TTS, DiffSpeech, ProDiff, Grad-TTS, WaveGrad2, WaveGrad, BDDM, DiffWave, Tacotron2-DCA, GlowTTS, FastPitch, VITS; LFCC-LCNN, Wav2vec + GAT, IDSD (for detection)

Author countries

Czech Republic

← Previous