How Does Instrumental Music Help SingFake Detection?

Authors: Xuanjun Chen, Chia-Yu Hu, I-Ming Lin, Yi-Cheng Lin, I-Hsiang Chiu, You Zhang, Sung-Feng Huang, Yi-Hsuan Yang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

Published: 2025-09-18 07:08:33+00:00

AI Summary

This paper investigates how instrumental music affects singing voice deepfake (SingFake) detection. It finds that instrumental accompaniment primarily acts as data augmentation, not providing intrinsic cues; fine-tuning increases reliance on shallow speaker features while reducing sensitivity to deeper content information.

Abstract

Although many models exist to detect singing voice deepfakes (SingFake), how these models operate, particularly with instrumental accompaniment, is unclear. We investigate how instrumental music affects SingFake detection from two perspectives. To investigate the behavioral effect, we test different backbones, unpaired instrumental tracks, and frequency subbands. To analyze the representational effect, we probe how fine-tuning alters encoders' speech and music capabilities. Our results show that instrumental accompaniment acts mainly as data augmentation rather than providing intrinsic cues (e.g., rhythm or harmony). Furthermore, fine-tuning increases reliance on shallow speaker features while reducing sensitivity to content, paralinguistic, and semantic information. These insights clarify how models exploit vocal versus instrumental cues and can inform the design of more interpretable and robust SingFake detection systems.


Key findings
Instrumental music mainly functions as data augmentation, not providing significant intrinsic cues for SingFake detection. Models heavily rely on low-frequency vocal information. Fine-tuning enhances sensitivity to speaker-specific features but diminishes sensitivity to deeper content, paralinguistic, and semantic information.
Approach
The researchers used behavioral and representational analyses to study the impact of instrumental music on SingFake detection. Behavioral analysis involved testing different model backbones with various audio inputs (vocal-only, vocal-instrumental, etc.) and frequency subbands. Representational analysis probed the model's encoders before and after fine-tuning using speech and music representation benchmarks.
Datasets
SingFake dataset [3], MUSAN dataset [28]
Model(s)
Spec-ResNet [22], AASIST [23], W2V2-AASIST [24], SingGraph [12], Wav2Vec2 [25], MERT [26]
Author countries
Taiwan, USA