Generalizable Audio Spoofing Detection using Non-Semantic Representations

Authors: Arnab Das, Yassine El Kheir, Carlos Franzreb, Tim Herzig, Tim Polzehl, Sebastian Möller

Published: 2025-08-29 18:37:57+00:00

AI Summary

This research introduces a novel audio deepfake detection method using non-semantic universal audio representations from TRILL and TRILLsson models. The approach achieves comparable in-domain performance to state-of-the-art methods but significantly surpasses them in out-of-domain generalization, especially on real-world datasets.

Abstract

Rapid advancements in generative modeling have made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data. This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations. Extensive experiments have been performed to find suitable non-semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in-domain test set while significantly outperforming state-of-the-art approaches on out-of-domain test sets. Notably, it demonstrates superior generalization on public-domain data, surpassing methods based on hand-crafted features, semantic embeddings, and end-to-end architectures.


Key findings
The proposed method using non-semantic features shows comparable in-domain performance to state-of-the-art methods. Importantly, it significantly outperforms these methods on out-of-domain datasets, demonstrating superior generalization capabilities, particularly on real-world, noisy data. Ablation studies confirm the benefit of non-semantic features over semantic features for generalization.
Approach
The method extracts non-semantic audio features using pre-trained TRILL and TRILLsson models. These features are processed through a convolutional block, LSTM layers, multi-head attention pooling, and an MLP to classify audio as genuine or spoofed. The model is trained using cross-entropy loss with class weights.
Datasets
ASVspoof 2019 Logical Access (LA19), ASVspoof 2021 Logical Access (LA21), ASVspoof 2021 Deepfake (DF21), In the Wild (ItW)
Model(s)
TRILL, TRILLsson (variants 1-4), 1D Convolutional Neural Network, LSTM, Multi-Head Attention, Multilayer Perceptron (MLP)
Author countries
Germany