If It's Good Enough for You, It's Good Enough for Me: Transferability of Audio Sufficiencies across Models

Authors: David A. Kelly, Hana Chockler

Published: 2026-04-03 10:08:53+00:00

AI Summary

This paper introduces transferability analysis to investigate the information processing characteristics of different audio classification models. It examines whether minimal sufficient signals for a classification on one model are accepted with the same classification by other models. The study applies this analysis to music genre, emotion recognition, and deepfake detection tasks, revealing varying transferability rates and identifying 'flat-earther' models with distinct transferability behaviors.

Abstract

In order to gain fresh insights about the information processing characteristics of different audio classification models, we propose transferability analysis. Given a minimal, sufficient signal for a classification on a model $f$, transferability analysis asks whether other models accept this minimal signal as having the same classification as it did on $f$. We define what it means for a sufficient signal to be transferable and perform a large study over $3$ different classification tasks: music genre, emotion recognition and deepfake detection. We find that transferability rates vary depending on the task, with sufficient signals for music genre being transferable $\\approx26\\%$ of the time. The other tasks reveal much higher variance in transferability and reveal that some models, in particular on deepfake detection, have different transferability behavior. We call these models `flat-earther' models. We investigate deepfake audio in more depth, and show that transferability analysis also allows to us to discover information theoretic differences between the models which are not captured by the more familiar metrics of accuracy and precision.

Key findings

Transferability rates vary significantly by task, with music genre signals being transferable ~26% of the time, while deepfake detection shows much higher variance. The study identifies 'flat-earther' models (e.g., VC5, SP1) whose sufficiencies are not widely accepted by others and vice versa, despite similar accuracy. Deepfake audio classifications exhibit distinct information-theoretic characteristics (spectral entropy, power spectral density) between 'real' and 'fake' signals, which are leveraged differently by models.

Approach

The authors propose transferability analysis, which uses 'freqrex' to identify minimal sufficient signals (subsets of frequencies required for a classification). They then test if these signals maintain their original classification and confidence levels across different audio classification models, using actual causality to define sufficiency and completeness.

Datasets

RAVDESS (for voice emotion), GTZAN (for music genre), ASVspoof2019 (for deepfake detection), and "In The Wild" (ITW) (for deepfake detection).

Model(s)

A total of 13 different fine-tuned audio deep learning models are used across tasks, built on various large baseline audio model backbones. Specific backbone types mentioned include Wav2Vec2, Hubert, DistHubt, and Whisp (for voice emotion and spoof detection), and DistHubt, Whisp, and Trans (for music genre).

Author countries

← Previous