Easy, Interpretable, Effective: openSMILE for voice deepfake detection

Authors: Octavian Pascu, Dan Oneata, Horia Cucu, Nicolas M. Müller

Published: 2024-08-28 13:14:18+00:00

AI Summary

This paper demonstrates that voice deepfake attacks in the ASVspoof5 dataset can be accurately detected using a small subset of simple, interpretable openSMILE features. A threshold classifier using these features achieves EERs as low as 0.8% for specific attacks, with an overall EER of 15.7 ± 6.0%. The study also reveals that feature generalization is primarily effective between attacks from similar Text-to-Speech architectures, suggesting unique TTS system 'fingerprints' are being identified.

Abstract

In this paper, we demonstrate that attacks in the latest ASVspoof5 dataset -- a de facto standard in the field of voice authenticity and deepfake detection -- can be identified with surprising accuracy using a small subset of very simplistic features. These are derived from the openSMILE library, and are scalar-valued, easy to compute, and human interpretable. For example, attack A10`s unvoiced segments have a mean length of 0.09 +- 0.02, while bona fide instances have a mean length of 0.18 +- 0.07. Using this feature alone, a threshold classifier achieves an Equal Error Rate (EER) of 10.3% for attack A10. Similarly, across all attacks, we achieve up to 0.8% EER, with an overall EER of 15.7 +- 6.0%. We explore the generalization capabilities of these features and find that some of them transfer effectively between attacks, primarily when the attacks originate from similar Text-to-Speech (TTS) architectures. This finding may indicate that voice anti-spoofing is, in part, a problem of identifying and remembering signatures or fingerprints of individual TTS systems. This allows to better understand anti-spoofing models and their challenges in real-world application.


Key findings
Single, scalar-valued openSMILE features can achieve surprisingly accurate in-domain detection, with EERs as low as 0.8% for attack A14 and an average of 15.7 ± 6.0%. Generalization of these features between attacks is effective when the underlying Text-to-Speech (TTS) architectures are similar, but challenging across different architectures. When using all features with a logistic regression, Wav2Vec2 generally outperforms openSMILE in both in-domain and out-of-domain settings, though openSMILE provides more distinct predictive individual features.
Approach
The authors utilize the openSMILE library to extract a small subset of scalar-valued, human-interpretable features, primarily from the eGeMAPSv2 set. These features are then used with simple threshold classifiers or logistic regression models to distinguish between bona fide and spoofed audio, evaluating performance in both in-domain and out-of-domain scenarios.
Datasets
ASVspoof5
Model(s)
Threshold classifier, Linear classification model, Logistic regression classifier (applied to openSMILE features), Wav2Vec2 (for comparison)
Author countries
Romania, Germany