All-for-One and One-For-All: Deep learning-based feature fusion for Synthetic Speech Detection

Authors: Daniele Mari, Davide Salvi, Paolo Bestagini, Simone Milani

Published: 2023-07-28 13:50:25+00:00

Comment: Accepted at ECML-PKDD 2023 Workshop "Deep Learning and Multimedia Forensics. Combating fake media and misinformation"

AI Summary

This paper proposes a deep learning-based system for synthetic speech detection that fuses three distinct feature sets: First Digit (FD), short-term long-term (STLT), and bicoherence features. The model leverages an end-to-end deep learning approach to integrate these features, achieving superior performance compared to state-of-the-art single-feature solutions. The system demonstrates robustness against anti-forensic attacks and strong generalization capabilities across various datasets.

Abstract

Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever, leading to possible threats and dangers from malicious users. In the audio field, we are witnessing the growth of speech deepfake generation techniques, which solicit the development of synthetic speech detection algorithms to counter possible mischievous uses such as frauds or identity thefts. In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them, achieving overall better performances with respect to the state-of-the-art solutions. The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.


Key findings
The proposed feature fusion model significantly outperforms individual feature-based detectors, achieving a higher AUC of 0.92 compared to 0.87 for the best single feature. It exhibits good generalization on unseen datasets (e.g., 90.7% balanced accuracy on LJSpeech) and strong robustness to MP3 compression. However, its performance degrades considerably under high Gaussian noise injection (SNR=2).
Approach
The system extracts three different feature sets (FD, STLT, and bicoherence) from the input audio signal. These features are then passed through separate Fully Connected (FC) neural networks to generate embeddings, which are subsequently concatenated. A final FC network takes this concatenated embedding as input to perform end-to-end synthetic speech detection.
Datasets
ASVspoof 2019 (LA partition), LJSpeech, LibriSpeech (train-clean-100), Cloud2019, VidTIMIT
Model(s)
Fully Connected (FC) neural networks with LeakyReLU activation, Dropout, and BatchNorm1D layers for embedding extraction and final classification.
Author countries
Italy