Comparative Analysis of ASR Methods for Speech Deepfake Detection

Authors: Davide Salvi, Amit Kumar Singh Yadav, Kratika Bhagtani, Viola Negroni, Paolo Bestagini, Edward J. Delp

Published: 2024-11-26 11:51:10+00:00

Comment: Published at Asilomar Conference on Signals, Systems, and Computers 2024

AI Summary

This paper systematically analyzes the relationship between Automatic Speech Recognition (ASR) performance and speech deepfake detection capabilities. The authors adapt pre-trained self-supervised ASR models, Whisper and Wav2Vec 2.0, as feature extractors for binary speech deepfake detection. They investigate whether improvements in ASR performance, corresponding to larger model versions, correlate with enhanced deepfake detection.

Abstract

Recent techniques for speech deepfake detection often rely on pre-trained self-supervised models. These systems, initially developed for Automatic Speech Recognition (ASR), have proved their ability to offer a meaningful representation of speech signals, which can benefit various tasks, including deepfake detection. In this context, pre-trained models serve as feature extractors and are used to extract embeddings from input speech, which are then fed to a binary speech deepfake detector. The remarkable accuracy achieved through this approach underscores a potential relationship between ASR and speech deepfake detection. However, this connection is not yet entirely clear, and we do not know whether improved performance in ASR corresponds to higher speech deepfake detection capabilities. In this paper, we address this question through a systematic analysis. We consider two different pre-trained self-supervised ASR models, Whisper and Wav2Vec 2.0, and adapt them for the speech deepfake detection task. These models have been released in multiple versions, with increasing number of parameters and enhanced ASR performance. We investigate whether performance improvements in ASR correlate with improvements in speech deepfake detection. Our results provide insights into the relationship between these two tasks and offer valuable guidance for the development of more effective speech deepfake detectors.


Key findings
While larger ASR models generally improve deepfake detection, this trend plateaus beyond a certain model size (e.g., Whisper medium often outperforms large). ASR performance, particularly lower Word Error Rate (WER), does not always directly correlate with higher deepfake detection accuracy. Detection capabilities of models are less hierarchically connected than expected, with smaller models sometimes identifying tracks missed by larger ones.
Approach
The authors adapt pre-trained self-supervised ASR models (Whisper and Wav2Vec 2.0) to serve as frozen embedding extractors. The extracted speech embeddings are then fed into a separate, trainable deepfake classifier (consisting of fully connected layers) for binary classification of speech as authentic or synthetic.
Datasets
ASVspoof 2019 (LA partition), ASVspoof 2021 (DF partition), AISEC “In-the-Wild”, TIMIT-TTS, LJspeech, FakeOrReal
Model(s)
Whisper (tiny, base, small, medium, large versions), Wav2Vec 2.0 (base, large, xls-r versions), Fully Connected (FC) networks as classifiers.
Author countries
Italy, USA