Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

Authors: Tung Vu, Yen Nguyen, Hai Nguyen, Cuong Pham, Cong Tran

Published: 2026-05-04 04:54:29+00:00

AI Summary

This paper addresses the challenge of detecting and localizing multiple, short, independently inpainted speech segments within an utterance, a scenario where existing audio deepfake detectors largely fail. The authors introduce MIST, a large-scale multilingual dataset for multi-region word-level tampering, and propose ISA, an iterative coarse-to-fine framework for localizing these tampered regions. They also define SF1@τ, a novel segment-level F1 metric, demonstrating that ISA consistently outperforms non-iterative baselines in zero-shot settings, highlighting partial inpainting as an unsolved problem.

Abstract

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.


Key findings
Zero-shot evaluation reveals that existing utterance-level deepfake detectors are largely ineffective against fine-grained partial speech inpainting, assigning near-zero fake probability to MIST utterances. The proposed ISA framework consistently outperforms non-iterative baselines, demonstrating the benefit of its iterative refinement and gap-tolerant merging. Fine-tuning the backbone classifier on MIST data leads to significant performance improvements (SF1@0.5 from 1.2% to 31.4%), highlighting the critical need for task-specific training data for this challenging problem.
Approach
The proposed Iterative Segment Analysis (ISA) framework is a backbone-agnostic pipeline that localizes tampered regions through three stages. It first performs a coarse-grain sliding-window classification, then converts the confidence map into candidate regions via thresholding and gap-tolerant merging, and finally refines the boundaries of these regions at a finer temporal resolution. This approach allows detection and localization of an unknown number of tampered segments.
Datasets
MIST (Multi-region Inpainting Speech Tampering), Multilingual LibriSpeech (MLS), LEMAS-Dataset.
Model(s)
Wav2Vec2-AASIST, WavLM-AASIST, Wav2Vec2-Linear, and a pre-trained Wav2Vec 2.0-base deepfake classifier.
Author countries
Vietnam