Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

View on arXiv ← Back to list

Authors: Yuankun Xie, Xiaoxuan Guo, Jiayi Zhou, Tao Wang, Jian Liu, Ruibo Fu, Xiaopeng Wang, Haonan Cheng, Long Ye

Published: 2026-01-06 12:50:02+00:00

AI Summary

This paper introduces Frequency Time-Group Relative Policy Optimization (FT-GRPO), a method for interpretable, all-type Audio Deepfake Detection (ADD) utilizing Audio Large Language Models (ALLMs). It employs an automatic annotation pipeline to create Frequency-Time structured Chain-of-Thought (CoT) rationales for cold-starting the ALLM via SFT, followed by GRPO constrained by rule-based FT reasoning rewards. This approach achieves state-of-the-art results and provides FT-grounded explanations across diverse audio domains like speech, music, and environmental sounds.

Abstract

Recent advances in audio large language models (ALLMs) have made high-quality synthetic audio widely accessible, increasing the risk of malicious audio deepfakes across speech, environmental sounds, singing voice, and music. Real-world audio deepfake detection (ADD) therefore requires all-type detectors that generalize across heterogeneous audio and provide interpretable decisions. Given the strong multi-task generalization ability of ALLMs, we first investigate their performance on all-type ADD under both supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). However, SFT using only binary real/fake labels tends to reduce the model to a black-box classifier, sacrificing interpretability. Meanwhile, vanilla RFT under sparse supervision is prone to reward hacking and can produce hallucinated, ungrounded rationales. To address this, we propose an automatic annotation and polishing pipeline that constructs Frequency-Time structured chain-of-thought (CoT) rationales, producing ~340K cold-start demonstrations. Building on CoT data, we propose Frequency Time-Group Relative Policy Optimization (FT-GRPO), a two-stage training paradigm that cold-starts ALLMs with SFT and then applies GRPO under rule-based frequency-time constraints. Experiments demonstrate that FT-GRPO achieves state-of-the-art performance on all-type ADD while producing interpretable, FT-grounded rationales. The data and code are available online.

Key findings

FT-GRPO achieved state-of-the-art performance, reaching 99.75% accuracy on the ASVspoof2019LA evaluation set using a 3B model trained only on speech data. When co-trained on all audio types, the method achieved the highest overall performance with an average accuracy of 90.10%, a significant gain over SFT baselines. Crucially, FT-GRPO successfully generates reliable and interpretable Frequency-Time grounded rationales alongside its high detection accuracy.

Approach

The approach begins with an automatic annotation pipeline generating ~340K Frequency-Time (FT) structured Chain-of-Thought (CoT) rationales. The model is trained using FT-GRPO, a two-stage training paradigm. This involves Supervised Fine-Tuning (SFT) using the CoT data for cold start, followed by Group Relative Policy Optimization (GRPO) using a composite reward system that encourages high accuracy, proper output format, and complete FT-domain reasoning.

Datasets

ASVspoof2019LA (19LA), ESDD, CtrSVDD, FakeMusicCaps

Model(s)

Qwen2.5-Omni-3B (main implementation), Qwen2-Audio-Chat-7B, Qwen2.5-Omni-7B.

Author countries

China

← Previous