Diffuse or Confuse: A Diffusion Deepfake Speech Dataset

Authors: Anton Firc, Kamil Malinka, Petr Hanáček

Published: 2024-10-09 11:51:08+00:00

Comment: Presented at International Conference of the Biometrics Special Interest Group (BIOSIG 2024)

Journal Ref: 2024 International Conference of the Biometrics Special Interest Group (BIOSIG)

AI Summary

This paper introduces a novel deepfake speech dataset generated using diffusion models to evaluate their impact on current deepfake detection systems. The study compares diffusion-generated deepfakes with non-diffusion ones, assessing their quality and detectability. Findings suggest that diffusion-based deepfakes are generally comparable to non-diffusion deepfakes in terms of detection, with some variability across detector architectures.

Abstract

Advancements in artificial intelligence and machine learning have significantly improved synthetic speech generation. This paper explores diffusion models, a novel method for creating realistic synthetic speech. We create a diffusion dataset using available tools and pretrained models. Additionally, this study assesses the quality of diffusion-generated deepfakes versus non-diffusion ones and their potential threat to current deepfake detection systems. Findings indicate that the detection of diffusion-based deepfakes is generally comparable to non-diffusion deepfakes, with some variability based on detector architecture. Re-vocoding with diffusion vocoders shows minimal impact, and the overall speech quality is comparable to non-diffusion methods.


Key findings
The detection of diffusion-based deepfakes is generally comparable to non-diffusion deepfakes, with some variability depending on the detector architecture. Re-vocoding non-diffusion samples with diffusion vocoders showed minimal impact on detection outcomes. While audio quality is comparable, diffusion-based methods often introduce more noise into the final recordings, evidenced by lower SNR values.
Approach
The authors create a deepfake speech dataset by synthesizing speech from the LJSpeech dataset using various diffusion-based and non-diffusion text-to-speech models and vocoders. They then evaluate the detectability of these deepfakes using three state-of-the-art deepfake speech detectors, comparing Equal Error Rates, and also assess speech quality using metrics like WER, PESQ, speaker similarity, and SNR.
Datasets
LJSpeech dataset, ASVspoof2019 LA training set, Diffusion Deepfake Speech Dataset (created by the authors)
Model(s)
LFCC-LCNN, Wav2vec + GAT, IDSD (for detection); DiffGAN-TTS, DiffSpeech, ProDiff, Grad-TTS, WaveGrad2, WaveGrad, BDDM, DiffWave, Tacotron2-DCA, GlowTTS, FastPitch, VITS (for deepfake generation)
Author countries
Czech Republic