Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge

Authors: Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Haonan Cheng, Long Ye

Published: 2024-08-13 14:15:15+00:00

AI Summary

This paper addresses open-domain audio deepfake detection for the ASVspoof5 Track1 challenge by investigating data expansion, data augmentation, and self-supervised learning (SSL) features. They introduce Frequency Mask, a data augmentation method, to counter high-frequency gaps characteristic of the ASVspoof5 dataset. Combining temporal information from various scales with multiple SSL features through score fusion, their approach achieved a minDCF of 0.0158 and an EER of 0.55% on the ASVspoof5 Track 1 evaluation progress set.

Abstract

ASVspoof5, the fifth edition of the ASVspoof series, is one of the largest global audio security challenges. It aims to advance the development of countermeasure (CM) to discriminate bonafide and spoofed speech utterances. In this paper, we focus on addressing the problem of open-domain audio deepfake detection, which corresponds directly to the ASVspoof5 Track1 open condition. At first, we comprehensively investigate various CM on ASVspoof5, including data expansion, data augmentation, and self-supervised learning (SSL) features. Due to the high-frequency gaps characteristic of the ASVspoof5 dataset, we introduce Frequency Mask, a data augmentation method that masks specific frequency bands to improve CM robustness. Combining various scale of temporal information with multiple SSL features, our experiments achieved a minDCF of 0.0158 and an EER of 0.55% on the ASVspoof 5 Track 1 evaluation progress set.


Key findings
The proposed Frequency Mask data augmentation method significantly improved countermeasure robustness. Integrating various temporal scales and multi-viewed SSL features via score fusion achieved state-of-the-art performance on the ASVspoof5 evaluation progress set, with a minDCF of 0.0158 and an EER of 0.55%. However, a significant performance drop was observed on the ASVspoof5 evaluation full set, suggesting unseen deepfake or codec methods.
Approach
The authors tackle audio deepfake detection by exploring data expansion, data augmentation (including a novel Frequency Mask method), and self-supervised learning (SSL) features. They extract SSL features from pre-trained models at different temporal lengths and combine multiple countermeasure models (CMs) from both temporal and feature-type perspectives using a logits score fusion technique.
Datasets
ASVspoof5 (training, development, evaluation progress/full sets), ASVspoof2019LA, MLAAD, Codecfake, MUSAN, RIR, LibriSpeech (for SSL pre-training)
Model(s)
WavLM (wavlm-base), Wav2vec2-large, UniSpeech (UniSpeech-SAT-Base), AASIST (SSL-adapted version)
Author countries
China