StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection

Authors: Zhentao Liu, Milos Cernak

Published: 2026-04-13 18:06:43+00:00

Comment: ICASSP 2026

AI Summary

This paper introduces StreamMark, a novel deep learning-based, semi-fragile audio watermarking system designed for proactive deepfake detection. It aims to be robust against benign audio conversions like compression while being fragile to malicious, semantic-altering manipulations such as voice conversion and speech editing. The method utilizes a complex-domain embedding within an Encoder-Distortion-Decoder architecture, trained to differentiate between these two classes of transformations.

Abstract

The rapid advancement of generative AI has made it increasingly challenging to distinguish between deepfake audio and authentic human speech. To overcome the limitations of passive detection methods, we propose StreamMark, a novel deep learning-based, semi-fragile audio watermarking system. StreamMark is designed to be robust against benign audio conversions that preserve semantic meaning (e.g., compression, noise) while remaining fragile to malicious, semantics-altering manipulations (e.g., voice conversion, speech editing). Our method introduces a complex-domain embedding technique within a unique Encoder-Distortion-Decoder architecture, trained explicitly to differentiate between these two classes of transformations. Comprehensive benchmarks demonstrate that StreamMark achieves high imperceptibility (SNR 24.16 dB, PESQ 4.20), is resilient to real-world distortions like Opus encoding, and exhibits principled fragility against a suite of deepfake attacks, with message recovery accuracy dropping to chance levels (~50%), while remaining robust to benign AI-based style transfers (ACC >98%).


Key findings
StreamMark achieved high imperceptibility (PESQ 4.20) and demonstrated exceptional robustness against benign conversions, including real-world Opus encoding (ACC >99.89%). Crucially, it exhibited principled fragility against malicious deepfake attacks, with message recovery accuracy dropping to chance levels (~50%), while remaining highly robust (ACC >98%) to benign AI-based style transfers.
Approach
StreamMark employs a deep learning-based Encoder-Distortion-Decoder architecture that embeds a watermark message in the STFT complex domain. Its semi-fragile behavior is achieved through a unique training objective that optimizes for robustness to benign transformations (e.g., noise, compression) while maximizing fragility to malicious, semantics-altering deepfake attacks (e.g., TTS, VC) using a dual-path distortion layer.
Datasets
Librispeech dataset (train clean100, test clean), custom Deepfake Benchmark (test set B)
Model(s)
StreamMark (Encoder-Distortion-Decoder architecture, 2D convolutional networks with skip gated blocks)
Author countries
Switzerland