Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset

Authors: Faheem Ahmad, Ajan Ahmed, Masudul Imtiaz

Published: 2026-04-15 01:59:43+00:00

Comment: Accepted for Oral Presentation at The 35th IEEE Microelectronics Design and Test Symposium

AI Summary

This paper establishes an interpretable classical machine learning baseline for deepfake audio detection using the Fake-or-Real (FoR) dataset. It extracts prosodic, voice-quality, and spectral features from audio clips and evaluates various classifiers. The RBF SVM achieves strong performance, highlighting pitch variability and spectral richness as key discriminative cues for separating real from synthetic speech.

Abstract

Deep learning has enabled highly realistic synthetic speech, raising concerns about fraud, impersonation, and disinformation. Despite rapid progress in neural detectors, transparent baselines are needed to reveal which acoustic cues reliably separate real from synthetic speech. This paper presents an interpretable classical machine learning baseline for deepfake audio detection using the Fake-or-Real (FoR) dataset. We extract prosodic, voice-quality, and spectral features from two-second clips at 44.1 kHz (high-fidelity) and 16 kHz (telephone-quality) sampling rates. Statistical analysis (ANOVA, correlation heatmaps) identifies features that differ significantly between real and fake speech. We then train multiple classifiers -- Logistic Regression, LDA, QDA, Gaussian Naive Bayes, SVMs, and GMMs -- and evaluate performance using accuracy, ROC-AUC, EER, and DET curves. Pairwise McNemar's tests confirm statistically significant differences between models. The best model, an RBF SVM, achieves ~93% test accuracy and ~7% EER on both sampling rates, while linear models reach ~75% accuracy. Feature analysis reveals that pitch variability and spectral richness (spectral centroid, bandwidth) are key discriminative cues. These results provide a strong, interpretable baseline for future deepfake audio detectors.


Key findings
The RBF SVM significantly outperformed other models, achieving approximately 93% test accuracy and around 7% Equal Error Rate (EER) on both high-fidelity (44.1 kHz) and telephone-quality (16 kHz) audio. Pitch variability and spectral richness (e.g., spectral centroid, bandwidth) were identified as the most discriminative acoustic cues. Performance did not degrade on re-recorded, lower-bandwidth audio, indicating cue persistence under channel distortions.
Approach
The authors extract hand-engineered acoustic features including prosodic (e.g., F0 mean, std, range), voice-quality (jitter, shimmer), and spectral (e.g., RMS energy, spectral centroid, bandwidth) features from two-second audio clips. After feature selection using ANOVA, these features are fed into classical machine learning classifiers such as SVMs, Logistic Regression, LDA, QDA, Gaussian Naive Bayes, and GMMs for deepfake detection.
Datasets
Fake-or-Real (FoR) dataset (specifically for-2sec at 44.1 kHz and for-rerec at 16 kHz variants). The FoR dataset sources real speech from CMU Arctic, LJSpeech, and VoxForge.
Model(s)
Logistic Regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Gaussian Naive Bayes (GNB), Support Vector Machines (SVMs) (linear and Radial Basis Function (RBF) kernel), and Gaussian Mixture Models (GMMs).
Author countries
USA