Does Fine-tuning by Reinforcement Learning Improve Generalization in Binary Speech Deepfake Detection?

Authors: Xin Wang, Ge Wanying, Junichi Yamagishi

Published: 2026-03-03 12:13:53+00:00

Comment: Submitted to Interspeech 2026; put on arxiv based on requirement of paper open-access rule; quote from Interspeech: "Interspeech no longer enforces an anonymity period for submissions. While uploading a version online is permitted, your official submission to Interspeech must not contain any author-identifying information"

AI Summary

This paper investigates the effectiveness of reinforcement learning, specifically Group Relative Policy Optimization (GRPO), for fine-tuning speech deepfake detection models to improve generalization to unseen attacks. Unlike conventional supervised fine-tuning (SFT), GRPO-based fine-tuning enhances performance on out-of-domain test sets while maintaining in-domain performance. Ablation studies suggest that the negative reward in GRPO is a crucial factor for this improvement.

Abstract

Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches rely solely on supervised fine-tuning (SFT). Inspired by the field of large language models, wherein reinforcement learning (RL) is used for model fine-tuning, we investigate the impact of RL, specifically Group Relative Policy Optimization (GRPO). The results from experiments using multiple detectors and test sets indicate that pure GRPO-based fine-tuning improves performance on out-of-domain test sets while maintaining performance on target-domain test data. This approach outperforms both SFT-only and hybrid setups. Our ablation studies further suggest that the negative reward in GRPO may be a key factor in this improvement.


Key findings
Pure GRPO-based fine-tuning significantly improved detection performance on out-of-domain test sets while preserving in-domain performance, outperforming SFT-only and hybrid SFT→GRPO setups. The negative reward component of GRPO was identified as a key contributor to this generalization improvement. GRPO was found to be effective primarily when applied to post-trained models, rather than directly to pre-trained SSL front-ends.
Approach
The authors apply Group Relative Policy Optimization (GRPO) to fine-tune pre-trained and post-trained SSL-based speech deepfake detection models. This approach contrasts with the standard Supervised Fine-Tuning (SFT) paradigm and utilizes a reward function and group-normalized advantage to optimize model parameters, aiming to improve generalization across different domains.
Datasets
Deepfake-Eval-2024 (DFE24) for fine-tuning and in-domain evaluation. Out-of-domain evaluation datasets include Audio Deepfake Detection Challenge Track 1.2 evaluation set (ADD23), Fake-OrReal test set (FoR), segmented DEEP-VOICE dataset (DV), and In-the-Wild dataset (ItW). ASVspoof 2019 development set was used as a reference for drift analysis.
Model(s)
SSL-based front-ends including XLS-R-2B, MMS-1B, and MMS-300M, coupled with a linear layer and softmax for binary classification.
Author countries
Japan