SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Authors: Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj

Published: 2024-07-26 05:23:41+00:00

AI Summary

This paper introduces SLIM (Style-LInguistics Mismatch), a novel model for generalized audio deepfake detection that addresses generalization and interpretability challenges. SLIM learns the style-linguistics dependency from only real speech samples via self-supervised pretraining. It then uses these learned dependency features, complemented by standard acoustic features, to classify real versus fake speech, yielding superior out-of-domain performance and providing explainable decisions by quantifying the mismatch.

Abstract

Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized from generative AI models. Existing ADD models suffer from generalization issues, with a large performance discrepancy between in-domain and out-of-domain data. Moreover, the black-box nature of existing models limits their use in real-world scenarios, where explanations are required for model decisions. To alleviate these issues, we introduce a new ADD model that explicitly uses the StyleLInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data. The features learned by SLIM allow us to quantify the (mis)match between style and linguistic content in a sample, hence facilitating an explanation of the model decision.


Key findings
SLIM significantly outperforms benchmark methods on out-of-domain datasets (In-the-wild and MLAAD-EN), achieving EERs of 12.9% and 13.5% respectively, while remaining competitive on in-domain data. The model demonstrates that a higher style-linguistics mismatch exists in deepfake samples, which SLIM effectively leverages for detection and interpretability. The learned dependency features are complementary to original subspace representations and facilitate explanation of model decisions.
Approach
SLIM is a two-stage framework. Stage 1 employs self-supervised pretraining on only real speech samples to learn style-linguistics dependency by minimizing cross-subspace distance and intra-subspace redundancy between compressed style and linguistics embeddings. Stage 2 then uses these learned dependency features, combined with original style and linguistics embeddings from frozen SSL encoders, to train a lightweight classification head for binary real/fake detection.
Datasets
Common Voice, RAVDESS, ASVspoof2019 LA (training, development, test sets), ASVspoof2021 DF (test set), In-the-wild, MLAAD v3 (English subset).
Model(s)
SLIM (Style-LInguistics Mismatch Model), Wav2vec-XLSR (fine-tuned for speech emotion recognition for style features, fine-tuned for automatic speech recognition for linguistic features), Attentive Statistics Pooling (ASP), Multi-Layer Perceptron (MLP), Bottleneck layers.
Author countries
UNKNOWN