Explaining Speaker and Spoof Embeddings via Probing

Authors: Xuechen Liu, Junichi Yamagishi, Md Sahidullah, Tomi kinnunen

Published: 2024-12-24 05:56:49+00:00

Comment: To appear in IEEE ICASSP 2025

AI Summary

This study investigates the explainability of 'spoof embeddings' from deep neural network-based audio spoofing detection systems, contrasting them with speaker embeddings. It uses probing analysis with simple neural classifiers to determine how well these embeddings capture speaker-related (metadata and acoustic) information. The research demonstrates that spoof embeddings preserve certain key traits like gender, speaking rate, F0, and duration, and leverage this information to ensure robust and gender-invariant spoof detection.

Abstract

This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker-related attributes as target labels. These attributes are categorized into two groups: metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.


Key findings
Spoof embeddings moderately preserve speaker-related meta-traits like gender, speaking rate, F0, and duration, unlike ASV embeddings which retain richer speaker-related information. Probing analysis indicates that spoofing detectors utilize preserved traits, such as gender, to achieve gender-invariant spoof detection. Speed perturbation experiments also confirmed that CM models leverage speaking rate and duration for robust spoofing detection, accommodating variations.
Approach
The authors employ a probing analysis approach where simple Multi-Layer Perceptron (MLP) classifiers are trained on either speaker or spoof embeddings. These MLPs are tasked with predicting various speaker-related attributes, categorized as metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). The performance of these probing classifiers indicates the extent to which the embeddings preserve specific information.
Datasets
ASVspoof 2019 LA, CSTR VCTK corpus (VCTK)
Model(s)
ECAPA-TDNN (for ASV embeddings), AASIST (for CM/spoof embeddings), Multi-Layer Perceptron (MLP) for probing classifiers
Author countries
Japan, India, Finland