Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models

Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

Published: 2025-06-03 20:16:41+00:00

Comment: Accepted to INTERSPEECH 2025

AI Summary

This paper introduces the novel task of Singing Voice Deepfake Source Attribution (SVDSA) and hypothesizes that Multimodal Foundation Models (MMFMs) are most effective due to their cross-modality pre-training. It proposes COFFE, a novel framework that uses Chernoff Distance as a loss function for effective fusion of foundation models. Through experiments, COFFE with MMFMs achieves state-of-the-art performance for SVDSA, establishing a new benchmark.

Abstract

In this work, we introduce the task of singing voice deepfake source attribution (SVDSA). We hypothesize that multimodal foundation models (MMFMs) such as ImageBind, LanguageBind will be most effective for SVDSA as they are better equipped for capturing subtle source-specific characteristics-such as unique timbre, pitch manipulation, or synthesis artifacts of each singing voice deepfake source due to their cross-modality pre-training. Our experiments with MMFMs, speech foundation models and music foundation models verify the hypothesis that MMFMs are the most effective for SVDSA. Furthermore, inspired from related research, we also explore fusion of foundation models (FMs) for improved SVDSA. To this end, we propose a novel framework, COFFE which employs Chernoff Distance as novel loss function for effective fusion of FMs. Through COFFE with the symphony of MMFMs, we attain the topmost performance in comparison to all the individual FMs and baseline fusion methods.


Key findings
Multimodal Foundation Models (MMFMs) significantly outperform unimodal speech and music foundation models for SVDSA, validating their hypothesis that cross-modality pre-training is crucial for capturing source-specific characteristics. The proposed COFFE framework, utilizing Chernoff Distance for feature alignment, achieved the highest performance when fusing MMFMs like LanguageBind and ImageBind (91.16% accuracy, 3.63% EER). This work establishes the first benchmark for the SVDSA task.
Approach
The authors introduce Singing Voice Deepfake Source Attribution (SVDSA) and hypothesize that Multimodal Foundation Models (MMFMs) like ImageBind and LanguageBind are superior for this task due to their cross-modality pre-training. They propose COFFE, a novel framework for fusing foundation models, which employs Chernoff Distance as a loss function to align their feature representations for improved attribution.
Datasets
CtrSVDD [24] (synthetic samples only, in Chinese and Japanese)
Model(s)
WavLM, Unispeech-SAT, Wav2vec2, XLS-R, Whisper, MMS, x-vector (Speech FMs); MERT variants (MERT-v1-330M, MERT-v1-95M, MERT-v0-public, MERT-v0), music2vec-v1 (Music FMs); ImageBind, LanguageBind (Multimodal FMs); COFFE framework leveraging Chernoff Distance and a downstream network (CNN or FCN).
Author countries
India, Estonia