Split and Conquer Partial Deepfake Speech

Authors: Inbal Rimon, Oren Gal, Haim Permuter

Published: 2026-04-03 09:33:01+00:00

AI Summary

This paper introduces a split-and-conquer framework for partial deepfake speech detection, which addresses the challenge of identifying manipulated regions within otherwise bona fide utterances. The approach decomposes the problem into two stages: boundary detection to identify temporal transition points, followed by segment-level classification to determine the authenticity of each resulting segment. This design simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, leading to state-of-the-art performance on relevant benchmarks.

Abstract

Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.


Key findings
The split-and-conquer framework achieves state-of-the-art performance on the PartialSpoof and Half-Truth datasets, significantly improving detection and localization of spoofed regions. Explicitly separating boundary detection and segment classification, along with feature-space augmentation via reflection-based multi-length training and fusion, enhances robustness and accuracy. The method consistently outperforms prior approaches across various temporal resolutions and strict overlap criteria, demonstrating superior overall detection quality and stability.
Approach
The method employs a two-stage split-and-conquer framework. First, a dedicated boundary detector identifies temporal transition points, segmenting the audio into acoustically consistent units. Second, each resulting segment is independently classified as bona fide or fake speech. A reflection-based multi-length training strategy and fusion of predictions across different configurations are used to improve robustness.
Datasets
PartialSpoof, Half-Truth (HAD)
Model(s)
ResNet34, wav2vec 2.0 XLSR (XLSR53, XLSR128), log-magnitude spectrogram front end
Author countries
Israel