Localizing Speech Deepfakes Beyond Transitions via Segment-Aware Learning
Authors: Yuchen Mao, Wen Huang, Yanmin Qian
Published: 2026-01-29 16:17:13+00:00
AI Summary
This paper introduces Segment-Aware Learning (SAL), a novel framework for localizing partial speech deepfakes by focusing on the intrinsic characteristics of entire manipulated segments rather than just transition artifacts. SAL employs Segment Positional Labeling for fine-grained frame supervision and Cross-Segment Mixing for robust data augmentation. The proposed method achieves state-of-the-art performance across various datasets, demonstrating improved generalization and reduced reliance on boundary cues.
Abstract
Localizing partial deepfake audio, where only segments of speech are manipulated, remains challenging due to the subtle and scattered nature of these modifications. Existing approaches typically rely on frame-level predictions to identify spoofed segments, and some recent methods improve performance by concentrating on the transitions between real and fake audio. However, we observe that these models tend to over-rely on boundary artifacts while neglecting the manipulated content that follows. We argue that effective localization requires understanding the entire segments beyond just detecting transitions. Thus, we propose Segment-Aware Learning (SAL), a framework that encourages models to focus on the internal structure of segments. SAL introduces two core techniques: Segment Positional Labeling, which provides fine-grained frame supervision based on relative position within a segment; and Cross-Segment Mixing, a data augmentation method that generates diverse segment patterns. Experiments across multiple deepfake localization datasets show that SAL consistently achieves strong performance in both in-domain and out-of-domain settings, with notable gains in non-boundary regions and reduced reliance on transition artifacts. The code is available at https://github.com/SentryMao/SAL.