Generalizable Audio Spoofing Detection using Non-Semantic Representations

Authors: Arnab Das, Yassine El Kheir, Carlos Franzreb, Tim Herzig, Tim Polzehl, Sebastian Möller

Published: 2025-08-29 18:37:57+00:00

Journal Ref: Proc. Interspeech 2025, 4553-4557

AI Summary

This study proposes a novel method for generalizable audio spoofing detection by leveraging non-semantic universal audio representations extracted using TRILL and TRILLsson models. The approach demonstrates comparable performance on in-domain test sets while significantly outperforming state-of-the-art methods on out-of-domain and public-domain test sets, highlighting its superior generalization capabilities.

Abstract

Rapid advancements in generative modeling have made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data. This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations. Extensive experiments have been performed to find suitable non-semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in-domain test set while significantly outperforming state-of-the-art approaches on out-of-domain test sets. Notably, it demonstrates superior generalization on public-domain data, surpassing methods based on hand-crafted features, semantic embeddings, and end-to-end architectures.


Key findings
The proposed method achieved comparable Equal Error Rate (EER) performance to state-of-the-art models on in-domain test sets (LA19). Crucially, it significantly outperformed existing methods on out-of-domain datasets (LA21, DF21) and real-world public-domain data (ItW), with the MT1 model achieving the best EER of 20.08% on ItW. An ablation study confirmed that non-semantic features are inherently better suited for generalization than semantic features for spoofing detection.
Approach
The method extracts non-semantic audio representations from chunked input audio using pre-trained TRILL or TRILLsson models. These features are then fed into a detector backend consisting of a convolutional block, optional delta step, LSTM layers, Multi-Head Attention (MHA) pooling, and an MLP block for binary classification of bonafide or fake audio.
Datasets
ASVspoof 2019 Logical Access (LA19), ASVspoof 2021 LA (LA21), ASVspoof 2021 DF (DF21), In the Wild (ItW)
Model(s)
TRILL, TRILLsson (variants 1, 2, 3, 4) for feature extraction, combined with a custom detection backend comprising a Convolutional block, LSTM layers, Multi-Head Attention (MHA) Pooling, and an MLP Block.
Author countries
Germany