Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection
Authors: Akanksha Chuchra, Shukesh Reddy, Sudeepta Mishra, Abhijit Das, Abhinav Dhall
Published: 2026-01-02 18:17:22+00:00
AI Summary
This study explores the potential of utilizing Multimodal Large Language Models (MLLMs), specifically Qwen2-Audio and SALMONN, for audio deepfake detection by framing the task as an Audio Question-Answering problem. The methodology involves feeding audio inputs alongside structured text prompts to guide the model's binary decision-making, testing both zero-shot and fine-tuned performance. Results indicate that while MLLMs perform poorly in zero-shot mode, they achieve strong detection performance on in-domain data after minimal task-specific fine-tuning.
Abstract
While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.