Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning

Authors: Artem Dvirniak, Evgeny Kushnir, Dmitrii Tarasov, Artem Iudin, Oleg Kiriukhin, Mikhail Pautov, Dmitrii Korzh, Oleg Y. Rogov

Published: 2026-03-11 12:59:12+00:00

AI Summary

This paper introduces HIR-SDD, a novel speech deepfake detection (SDD) framework designed to enhance generalization and interpretability. It integrates Large Audio Language Models (LALMs) with chain-of-thought reasoning, derived from a new human-annotated dataset. Experimental evaluations confirm the method's effectiveness in detection and its ability to provide human-perceptible justifications for predictions.

Abstract

The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD methods generally suffer from the lack of generalization to new audio domains and generators. More than that, they lack interpretability, especially human-like reasoning that would naturally explain the attribution of a given audio to the bona fide or spoof class and provide human-perceptible cues. In this paper, we propose HIR-SDD, a novel SDD framework that combines the strengths of Large Audio Language Models (LALMs) with the chain-of-thought reasoning derived from the novel proposed human-annotated dataset. Experimental evaluation demonstrates both the effectiveness of the proposed method and its ability to provide reasonable justifications for predictions.


Key findings
SALMONN-7B significantly outperformed Wav2Vec2-AASIST in binary classification metrics. The HIR-SDD framework effectively provides both competitive detection performance and human-interpretable, audio-grounded reasoning traces. While grounding and GRPO improved reasoning quality and diversity, the models still face challenges in generalizing to unseen high-fidelity synthesis systems.
Approach
The HIR-SDD framework combines Large Audio Language Models (LALMs) with chain-of-thought (CoT) reasoning. It employs supervised fine-tuning (SFT) on a newly created human-annotated dataset, followed by audio grounding and GRPO techniques to improve reasoning quality and ensure explanations are aligned with acoustic evidence. This enables the model to perform binary deepfake detection and provide human-interpretable justifications.
Datasets
ASVspoof 5, PyAra, LibriSecVoc, MLAAD, DFADD, M-AILABS2, Golos, SOVA3, Russian LibriSpeech (RuLS), SpeechEval, and a newly created human-annotated dataset for CoT training and evaluation.
Model(s)
SALMONN-7B (combining Whisper and BEATS audio encoders with a Q-Former adapter and Vicuna-7B/13B LLM) and Wav2Vec2-AASIST (for comparison). Qwen-32b and Qwen2.5-32B were used for post-processing and reward evaluation respectively.
Author countries
Russia