Authors: Juan M. Martín-Doñas, Aitor Álvarez
Published: 2022-03-03 08:49:17+00:00
Comment: Accepted by ICASSP 2022
AI Summary
This paper presents Vicomtech's audio deepfake detection system for the 2022 ADD challenge, utilizing a pre-trained Wav2Vec2 model as a feature extractor combined with a downstream classifier. The approach exploits contextualized speech representations from Wav2Vec2's transformer layers and employs data augmentation to enhance robustness in challenging environments. The system demonstrates strong performance in both the ASVspoof 2021 and 2022 ADD challenges across various realistic scenarios.
Abstract
This paper describes our submitted systems to the 2022 ADD challenge withing the tracks 1 and 2. Our approach is based on the combination of a pre-trained wav2vec2 feature extractor and a downstream classifier to detect spoofed audio. This method exploits the contextualized speech representations at the different transformer layers to fully capture discriminative information. Furthermore, the classification model is adapted to the application scenario using different data augmentation techniques. We evaluate our system for audio synthesis detection in both the ASVspoof 2021 and the 2022 ADD challenges, showing its robustness and good performance in realistic challenging environments such as telephonic and audio codec systems, noisy audio, and partial deepfakes.
Key findings
The proposed Wav2Vec2-based system achieved competitive results, ranking first in ADD 2022 Track 1 and fourth in Track 2. Data augmentation techniques, particularly low-pass FIR filters and the use of adaptation data, significantly improved the model's robustness and discrimination capabilities across challenging scenarios like telephonic systems, audio codecs, noisy environments, and partial deepfakes. The approach effectively utilizes information from different Wav2Vec2 transformer layers, adapting their weights based on the specific detection scenario.
Approach
The system employs a pre-trained Wav2Vec2 (W2V2) model as a feature extractor, leveraging contextualized speech representations from its different transformer layers. These extracted features are then fed into a downstream classification model consisting of feed-forward layers, attentive statistical pooling, and a cosine layer for scoring. The classification model is further adapted to specific application scenarios through various data augmentation techniques, including low-pass FIR filters and generation of new partially fake audios.
Datasets
ADD 2022 challenge database (train and dev sets based on AISHELL-3), ASVspoof 2021 challenge database (logical access and speech deepfake partitions, utilizing ASVspoof 2019 LA train/dev sets and VCTK corpus).
Model(s)
Wav2Vec2 (W2V2) large models (XLS-53 and XLS-128) as feature extractors, and a downstream classifier composed of Feed-forward (FF) layers with ReLU and dropout, Attentive Statistical Pooling, a Linear layer, and a Cosine Layer, trained with a One-class softmax loss function.
Author countries
Spain