Not All Deepfakes Are Created Equal: Triaging Audio Forgeries for Robust Deepfake Singer Identification

Authors: Davide Salvi, Hendrik Vincent Koops, Elio Quinton

Published: 2025-10-20 12:16:52+00:00

Comment: Accepted for presentation at the NeurIPS 2025 Workshop on Generative and Protective AI for Content Creation (non-archival)

AI Summary

This paper introduces a novel two-stage pipeline for robust singer identification in singing voice deepfakes, prioritizing high-quality forgeries. It first employs a discriminator to filter out low-quality deepfakes that fail to accurately reproduce vocal likeness. A subsequent singer identification model, trained exclusively on authentic recordings, then identifies the artist in the remaining high-quality deepfakes and authentic audio, outperforming existing baselines.

Abstract

The proliferation of highly realistic singing voice deepfakes presents a significant challenge to protecting artist likeness and content authenticity. Automatic singer identification in vocal deepfakes is a promising avenue for artists and rights holders to defend against unauthorized use of their voice, but remains an open research problem. Based on the premise that the most harmful deepfakes are those of the highest quality, we introduce a two-stage pipeline to identify a singer's vocal likeness. It first employs a discriminator model to filter out low-quality forgeries that fail to accurately reproduce vocal likeness. A subsequent model, trained exclusively on authentic recordings, identifies the singer in the remaining high-quality deepfakes and authentic audio. Experiments show that this system consistently outperforms existing baselines on both authentic and synthetic content.


Key findings
The proposed two-stage pipeline significantly enhances singer identification performance, consistently outperforming baselines on both authentic and synthetic content. The discriminator effectively filters low-quality deepfakes, making the subsequent identification task more tractable and meaningful by ensuring the singer identification model operates on realistic vocal likenesses. ECAPA-TDNN proved to be a strong performer for singer identification in this context.
Approach
The approach utilizes a two-stage pipeline. The first stage employs a Light Convolutional Neural Network (LCNN) as a discriminator to filter out low-quality deepfakes. The second stage uses an ECAPA-TDNN model, trained solely on authentic recordings, to perform singer identification on the high-quality deepfakes and authentic audio passed by the discriminator.
Datasets
PRIVATE (proprietary), ARTIST20, CTRSVDD, WILDSVDD
Model(s)
Light Convolutional Neural Network (LCNN) for the discriminator, ECAPA-TDNN for singer identification.
Author countries
United Kingdom