Not All Deepfakes Are Created Equal: Triaging Audio Forgeries for Robust Deepfake Singer Identification

View on arXiv ← Back to list

Authors: Davide Salvi, Hendrik Vincent Koops, Elio Quinton

Published: 2025-10-20 12:16:52+00:00

AI Summary

The paper addresses the challenge of identifying singers in highly realistic vocal deepfakes by introducing a novel two-stage pipeline that triages audio forgeries based on quality. This system first uses a discriminator to filter out low-quality deepfakes that fail to reproduce vocal likeness, prioritizing the most harmful, high-quality fakes. Experiments demonstrate that this triage approach significantly improves robust singer identification performance across both authentic and synthetic content compared to traditional baselines.

Abstract

The proliferation of highly realistic singing voice deepfakes presents a significant challenge to protecting artist likeness and content authenticity. Automatic singer identification in vocal deepfakes is a promising avenue for artists and rights holders to defend against unauthorized use of their voice, but remains an open research problem. Based on the premise that the most harmful deepfakes are those of the highest quality, we introduce a two-stage pipeline to identify a singer's vocal likeness. It first employs a discriminator model to filter out low-quality forgeries that fail to accurately reproduce vocal likeness. A subsequent model, trained exclusively on authentic recordings, identifies the singer in the remaining high-quality deepfakes and authentic audio. Experiments show that this system consistently outperforms existing baselines on both authentic and synthetic content.

Key findings

The two-stage pipeline (D ∘ S) significantly improved singer identification performance compared to using the identification model (S) alone, achieving an average AUC of 90.23% versus 81.76%. The results suggest that the performance degradation of singer identification models is primarily linked to low-quality deepfakes where the vocal likeness is unidentifiable, thus proving the value of the initial quality-based discriminator step.

Approach

The proposed D ∘ S pipeline first employs a Light Convolutional Neural Network (LCNN) discriminator (D) to classify input audio as low-quality deepfake (discarded) or authentic/high-quality deepfake (passed). The remaining high-quality tracks are then processed by a singer identification model (S), specifically an ECAPA-TDNN architecture trained only on authentic recordings, which identifies the singer via nearest neighbor search on extracted embeddings.

Datasets

PRIVATE (proprietary), ARTIST20, CTRSVDD, WILDSVDD

Model(s)

Light Convolutional Neural Network (LCNN), ECAPA-TDNN

Author countries

United Kingdom