Audio Deepfake Attribution: An Initial Dataset and Investigation

Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Jie Chen

Published: 2022-08-21 05:15:40+00:00

Comment: 13 pages, 5 figures. arXiv admin note: text overlap with arXiv:2208.10489v3

AI Summary

This paper introduces Audio Deepfake Attribution (ADA), a novel task for identifying the source generation tools of deepfake audio, moving beyond binary detection. It presents the first dataset for this purpose, also named ADA, and proposes the Class-Representation Multi-Center Learning (CRML) method to tackle the challenge of open-set attribution, particularly for unknown audio generation tools. The CRML method effectively addresses real-world open-set risks by learning discriminative representations.

Abstract

The rapid progress of deep speech synthesis models has posed significant threats to society such as malicious manipulation of content. This has led to an increase in studies aimed at detecting so-called deepfake audio. However, existing works focus on the binary detection of real audio and fake audio. In real-world scenarios such as model copyright protection and digital evidence forensics, binary classification alone is insufficient. It is essential to identify the source of deepfake audio. Therefore, audio deepfake attribution has emerged as a new challenge. To this end, we designed the first deepfake audio dataset for the attribution of audio generation tools, called Audio Deepfake Attribution (ADA), and conducted a comprehensive investigation on system fingerprints. To address the challenges of attribution of continuously emerging unknown audio generation tools in the real world, we propose the Class-Representation Multi-Center Learning (CRML) method for open-set audio deepfake attribution (OSADA). CRML enhances the global directional variation of representations, ensuring the learning of discriminative representations with strong intra-class similarity and inter-class discrepancy among known classes. Finally, the strong class discrimination capability learned from known classes is extended to both known and unknown classes. Experimental results demonstrate that the CRML method effectively addresses open-set risks in real-world scenarios. The dataset is publicly available at: https://zenodo.org/records/13318702, and https://zenodo.org/records/13340666.


Key findings
The proposed CRML method significantly outperforms classical baselines for open-set audio deepfake attribution, effectively mitigating open-set risks. WavLM-SENet achieved the best attribution performance on clean audio, while wav2vec 2.0 XLS-R-SENet demonstrated superior adaptability to compressed environments among pipeline models. Among end-to-end models, RawBMamba showed strong performance on both clean and compressed sets.
Approach
The authors propose the Class-Representation Multi-Center Learning (CRML) method for Open-Set Audio Deepfake Attribution (OSADA). CRML learns discriminative representations by enhancing global directional variation, ensuring strong intra-class similarity and inter-class discrepancy among known classes. This allows the model to effectively distinguish between known and unknown audio generation tools.
Datasets
Audio Deepfake Attribution (ADA) dataset (clean and compressed subsets), AISHELL-1, AISHELL-3, THCHS-30, Aidatatang 200zh.
Model(s)
X-vector (TDNN), SE-ResNet, ResNet (18-layer), LCNN, RawNet2, RawGAT-ST, AASIST, wav2vec2.0-AASIST, RawFormer, RawBMamba.
Author countries
China