Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought

Authors: Runkun Chen, Yixiong Fang, Pengyu Chang, Yuante Li, Massa Baali, Bhiksha Raj

Published: 2026-03-30 04:27:59+00:00

AI Summary

This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model for deepfake detection that integrates explicit acoustic chain-of-thought reasoning. It enhances detection accuracy and interpretability by injecting structured textual representations of low-level acoustic features directly into the model prompt. The model, built on a lightweight LLM, significantly outperforms existing audio language model baselines despite its smaller scale.

Abstract

Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.


Key findings
CoLMbo-DF significantly outperforms existing audio language model baselines, achieving up to 0.987 deepfake detection accuracy on ASVspoof 2019 with chain-of-thought supervision. Explicit acoustic evidence and CoT supervision are crucial for stable learning and substantial gains, as removing them caused training to collapse. The model also showed improved generalization to modern TTS systems when fine-tuned with a small amount of in-domain data.
Approach
CoLMbo-DF is a Feature-Guided Audio Language Model using a projection-based architecture. It processes audio embeddings from a pretrained encoder and maps them into the input space of an instruction-tuned LLM, concurrently injecting structured textual acoustic features into the LLM prompt. This approach trains the model to generate explicit chain-of-thought reasoning for grounded deepfake detection.
Datasets
FAKEREASON (derived from ASVspoof 2019 and VoxCeleb2, augmented with deepfakes generated by Fish-Speech and CosyVoice2 TTS models)
Model(s)
WavLM-base-plus (audio encoder), 6-layer QFormer network (projector), Llama 3.2-1B-Instruct (LLM)
Author countries
USA