Joint Fullband-Subband Modeling for High-Resolution SingFake Detection

Authors: Xuanjun Chen, Chia-Yu Hu, Sung-Feng Huang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

Published: 2026-04-06 16:42:59+00:00

Comment: Submitted to INTERSPEECH 2026

AI Summary

This study presents the first systematic analysis of high-resolution (44.1 kHz) audio for Singing Voice Deepfake (SingFake) Detection (SVDD), addressing the inadequacy of conventional 16 kHz detectors. It proposes Sing-HiResNet, a joint fullband-subband modeling framework that concurrently captures global spectral context and fine-grained frequency-specific synthesis artifacts. The framework significantly outperforms 16 kHz-sampled models, highlighting the critical role of high-resolution audio and strategic subband integration for robust in-the-wild detection.

Abstract

Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.

Key findings

High-resolution (44.1 kHz) audio and joint fullband-subband modeling significantly improve SingFake detection, outperforming 16 kHz models and large-scale SSL-based approaches. The Sing-HiResNet framework achieved state-of-the-art EERs of 1.58% on Test A and 7.45% on Test B of the WildSVDD dataset, representing substantial reductions over baselines. Grad-CAM visualizations confirmed that cross-expert distillation successfully transfers frequency-localized expertise, enhancing the student model's ability to focus on discriminative spectral cues.

Approach

The proposed Sing-HiResNet framework operates in two phases: fullband and subband expert models, followed by joint fullband-subband fusion strategies. It employs a fullband ResNet18 model for global context and multiple ResNet18-based subband expert models to isolate localized artifacts within specific frequency ranges of 44.1 kHz audio. These expert outputs are then integrated using various fusion strategies, including decision-level aggregation, feature-level concatenation, cross-expert interaction (using MHSA), and cross-expert distillation.

Datasets

WildSVDD dataset

Model(s)

ResNet18

Author countries

Taiwan

← Previous