Explaining Speaker and Spoof Embeddings via Probing
Authors: Xuechen Liu, Junichi Yamagishi, Md Sahidullah, Tomi kinnunen
Published: 2024-12-24 05:56:49+00:00
Comment: To appear in IEEE ICASSP 2025
AI Summary
This study investigates the explainability of 'spoof embeddings' from deep neural network-based audio spoofing detection systems, contrasting them with speaker embeddings. It uses probing analysis with simple neural classifiers to determine how well these embeddings capture speaker-related (metadata and acoustic) information. The research demonstrates that spoof embeddings preserve certain key traits like gender, speaking rate, F0, and duration, and leverage this information to ensure robust and gender-invariant spoof detection.
Abstract
This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker-related attributes as target labels. These attributes are categorized into two groups: metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.