Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing

Authors: Manasi Chhibber, Jagabandhu Mishra, Tomi H. Kinnunen

Published: 2025-09-29 12:14:58+00:00

AI Summary

This work proposes a novel zero-shot framework for open-set speech deepfake source tracing, adapting the SSL-AASIST system with AAM loss for improved attack embedding extraction. It systematically compares zero-shot (cosine, Siamese) and few-shot (MLP, Siamese) backend scoring methods to attribute synthesized speech to its generative source. Experiments confirm that zero-shot cosine scoring generalizes best in the difficult open-set scenario.

Abstract

We propose a novel zero-shot source tracing framework inspired by advances in speaker verification. Specifically, we adapt the SSL-AASIST system for attack classification, ensuring that the attacks used for training are disjoint from those used to form fingerprint-trial pairs. For backend scoring in attack verification, we explore both zero-shot approaches (cosine similarity and Siamese) and few-shot approaches (MLP and Siamese). Experiments on our recently introduced STOPA dataset suggest that few-shot learning provides advantages in the closed-set scenario, while zero-shot approaches perform better in the open-set scenario. In closed-set trials, few-shot Siamese and MLP achieve equal error rates (EER) of 18.44% and 15.11%, compared to 27.14% for zero-shot cosine scoring. Conversely, in open-set trials, zero-shot cosine scoring reaches 21.70%, outperforming few-shot Siamese and MLP at 27.40% and 22.65%, respectively.


Key findings
Few-shot methods, particularly the MLP classifier, achieve the best performance in closed-set tracing (EER 15.11%). However, for the challenging open-set scenario, the zero-shot cosine similarity approach proves most robust, achieving an EER of 21.70%. Incorporating the self-supervised front end (SSL-AASIST) and AAM loss consistently improved the generalization of the attack embeddings.
Approach
The framework utilizes an SSL-AASIST architecture trained with AAM-softmax loss to generate discriminative attack embeddings (fingerprints) from raw waveforms. Backend scoring employs a verification approach where trial embeddings are compared against enrolled attack fingerprints using either zero-shot methods (cosine similarity) or few-shot classifiers (MLP or Siamese networks trained on a small subset of fingerprint data).
Datasets
STOPA, ASVspoof 2019 LA
Model(s)
SSL-AASIST, AASIST, MLP, Siamese Network
Author countries
Finland