Collaborative Watermarking for Adversarial Speech Synthesis

Authors: Lauri Juvela, Xin Wang

Published: 2023-09-26 19:43:14+00:00

Comment: Accepted to ICASSP 2024

AI Summary

This paper proposes a collaborative training scheme for synthetic speech watermarking, where a HiFi-GAN neural vocoder works with ASVspoof 2021 baseline countermeasure models. This approach consistently improves detection performance over conventional classifier training. Furthermore, collaborative training, especially when paired with augmentation strategies, enhances robustness against noise and time-stretching with minimal adverse effects on perceptual quality.

Abstract

Advances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic speech detection has been related to the Automatic Speaker Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on passive countermeasures. This paper takes a complementary view to generated speech detection: a synthesis system should make an active effort to watermark the generated speech in a way that aids detection by another machine, but remains transparent to a human listener. We propose a collaborative training scheme for synthetic speech watermarking and show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance over conventional classifier training. Furthermore, we demonstrate how collaborative training can be paired with augmentation strategies for added robustness against noise and time-stretching. Finally, listening tests demonstrate that collaborative training has little adverse effect on perceptual quality of vocoded speech.


Key findings
Collaborative training consistently improved detection performance compared to conventional passive countermeasure training across various test conditions. While LFCC-LCNN struggled with additive noise, collaborative training with RawNet2 and data augmentation demonstrated good robustness against noise. Listening tests confirmed that the proposed collaborative training method had no statistically significant adverse effect on the perceptual quality of the vocoded speech.
Approach
The paper proposes a collaborative training scheme where a generative model (HiFi-GAN) is trained jointly with a watermark detector (ASVspoof baselines). The generator actively embeds a detectable watermark into the synthetic speech by allowing gradients from the detector to influence its training, aiming to make the generated speech easier for the detector to identify without impacting human perception.
Datasets
Voice Cloning Toolkit (VCTK) corpus, MUSAN database (noise subset)
Model(s)
HiFi-GAN (generative model), LFCC-LCNN (watermark detector), RawNet2 (watermark detector)
Author countries
Finland, Japan