Exploring Green AI for Audio Deepfake Detection

Authors: Subhajit Saha, Md Sahidullah, Swagatam Das

Published: 2024-03-21 10:54:21+00:00

Comment: This manuscript is under review in a conference

AI Summary

This study introduces a novel 'Green AI' framework for audio deepfake detection, focusing on minimizing the carbon footprint by enabling CPU-only training. It leverages off-the-shelf pre-trained self-supervised learning (SSL) models for feature extraction without fine-tuning, combined with classical machine learning algorithms for the downstream detection task. The approach demonstrates competitive performance with significantly fewer trainable parameters compared to high-carbon footprint deep neural network methods.

Abstract

The state-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance. Nonetheless, this advantage is accompanied by a significant carbon footprint. This is mainly due to the use of high-performance computing with accelerators and high training time. Studies show that average deep NLP model produces around 626k lbs of CO\\textsubscript{2} which is equivalent to five times of average US car emission at its lifetime. This is certainly a massive threat to the environment. To tackle this challenge, this study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources. Our proposed framework utilizes off-the-shelve self-supervised learning (SSL) based models which are pre-trained and available in public repositories. In contrast to existing methods that fine-tune SSL models and employ additional deep neural networks for downstream tasks, we exploit classical machine learning algorithms such as logistic regression and shallow neural networks using the SSL embeddings extracted using the pre-trained model. Our approach shows competitive results compared to the commonly used high-carbon footprint approaches. In experiments with the ASVspoof 2019 LA dataset, we achieve a 0.90\\% equal error rate (EER) with less than 1k trainable model parameters. To encourage further research in this direction and support reproducible results, the Python code will be made publicly accessible following acceptance. Github: https://github.com/sahasubhajit/Speech-Spoofing-


Key findings
The framework achieved a competitive Equal Error Rate (EER) of 0.90% and an F1 score of 0.95 on the ASVspoof 2019 LA dataset, primarily using an SVM classifier with embeddings from the second transformer layer of wav2vec 2.0. This performance was achieved with fewer than 1K trainable model parameters, demonstrating a successful low-carbon footprint solution trainable on CPU resources. The study also found that embeddings from earlier intermediate layers of the SSL model can be more effective than those from the last layer.
Approach
The proposed framework utilizes pre-trained wav2vec 2.0 BASE self-supervised learning models to extract speech embeddings from raw audio. Instead of fine-tuning the SSL model or employing complex deep neural networks, it uses classical machine learning algorithms such as Logistic Regression, SVM, and shallow MLPs for the deepfake detection task. This design allows for training and inference on standard CPU resources, drastically reducing computational cost and trainable parameters.
Datasets
ASVspoof 2019 LA subset, LibriSpeech (for pre-training of the SSL model)
Model(s)
wav2vec 2.0 BASE (for feature extraction), K-nearest neighbors (KNN), Logistic regression, Support vector machine (SVM), Naive Bayes, Decision tree, Multi-layer perceptron (MLP)
Author countries
India