Collaborative Watermarking for Adversarial Speech Synthesis

View on arXiv ← Back to list

Authors: Lauri Juvela, Xin Wang

Published: 2023-09-26 19:43:14+00:00

AI Summary

This paper proposes a collaborative training scheme for synthetic speech watermarking, where a HiFi-GAN vocoder is trained alongside ASVspoof 2021 baseline countermeasure models to embed a watermark aiding detection while maintaining audio quality. The approach improves detection performance compared to conventional training methods and shows robustness against noise and time-stretching.

Abstract

Advances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic speech detection has been related to the Automatic Speaker Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on passive countermeasures. This paper takes a complementary view to generated speech detection: a synthesis system should make an active effort to watermark the generated speech in a way that aids detection by another machine, but remains transparent to a human listener. We propose a collaborative training scheme for synthetic speech watermarking and show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance over conventional classifier training. Furthermore, we demonstrate how collaborative training can be paired with augmentation strategies for added robustness against noise and time-stretching. Finally, listening tests demonstrate that collaborative training has little adverse effect on perceptual quality of vocoded speech.

Key findings

Collaborative training consistently improves detection performance compared to traditional training methods across various noise and time-stretching conditions. The method shows robustness against noise and time stretching when using RawNet2 and data augmentation. Listening tests indicate that collaborative training has minimal impact on perceived audio quality.

Approach

The authors propose a collaborative training scheme for a HiFi-GAN vocoder and ASVspoof 2021 countermeasure models. The vocoder learns to embed a watermark detectable by the countermeasure models, while the models learn to detect this watermark. Augmentation strategies like noise addition and time-stretching are used to improve robustness.

Datasets

VCTK corpus (custom 80-10-10% split for training, validation, and testing), MUSAN noise dataset.

Model(s)

HiFi-GAN (neural vocoder), ASVspoof 2021 baseline countermeasure models (LFCC-LCNN and RawNet2).

Author countries

Finland, Japan

← Previous