Localizing Speech Deepfakes Beyond Transitions via Segment-Aware Learning

Authors: Yuchen Mao, Wen Huang, Yanmin Qian

Published: 2026-01-29 16:17:13+00:00

AI Summary

This paper introduces Segment-Aware Learning (SAL), a novel framework for localizing partial speech deepfakes by focusing on the intrinsic characteristics of entire manipulated segments rather than just transition artifacts. SAL employs Segment Positional Labeling for fine-grained frame supervision and Cross-Segment Mixing for robust data augmentation. The proposed method achieves state-of-the-art performance across various datasets, demonstrating improved generalization and reduced reliance on boundary cues.

Abstract

Localizing partial deepfake audio, where only segments of speech are manipulated, remains challenging due to the subtle and scattered nature of these modifications. Existing approaches typically rely on frame-level predictions to identify spoofed segments, and some recent methods improve performance by concentrating on the transitions between real and fake audio. However, we observe that these models tend to over-rely on boundary artifacts while neglecting the manipulated content that follows. We argue that effective localization requires understanding the entire segments beyond just detecting transitions. Thus, we propose Segment-Aware Learning (SAL), a framework that encourages models to focus on the internal structure of segments. SAL introduces two core techniques: Segment Positional Labeling, which provides fine-grained frame supervision based on relative position within a segment; and Cross-Segment Mixing, a data augmentation method that generates diverse segment patterns. Experiments across multiple deepfake localization datasets show that SAL consistently achieves strong performance in both in-domain and out-of-domain settings, with notable gains in non-boundary regions and reduced reliance on transition artifacts. The code is available at https://github.com/SentryMao/SAL.

Key findings

SAL consistently achieves strong performance on deepfake localization across multiple datasets (PS, HAD, LPS) in both in-domain and out-of-domain settings, notably outperforming existing transition-focused methods. It establishes new state-of-the-art results on the HAD dataset and shows superior generalization on the challenging LPS dataset. The approach significantly improves detection accuracy in non-boundary regions and effectively mitigates shortcut learning that over-relies on transition artifacts.

Approach

The authors propose Segment-Aware Learning (SAL), which encourages models to understand the entire manipulated segments. This is achieved through two core techniques: Segment Positional Labeling (SPL), which provides fine-grained frame supervision based on relative position within a segment, and Cross-Segment Mixing (CSM), a data augmentation method that generates diverse segment patterns by splicing utterances.

Datasets

PartialSpoof (PS), Half-truth Audio Detection (HAD) dataset, LlamaPartialSpoof (LPS)

Model(s)

Pre-trained SSL models (Wav2Vec2-XLSR, WavLM-Large) as front-end feature extractors, optionally followed by a lightweight Conformer module, and MLP layers for final predictions.

Author countries

China

← Previous