Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection

View on arXiv ← Back to list

Authors: Akanksha Chuchra, Shukesh Reddy, Sudeepta Mishra, Abhijit Das, Abhinav Dhall

Published: 2026-01-02 18:17:22+00:00

AI Summary

This study explores the potential of utilizing Multimodal Large Language Models (MLLMs), specifically Qwen2-Audio and SALMONN, for audio deepfake detection by framing the task as an Audio Question-Answering problem. The methodology involves feeding audio inputs alongside structured text prompts to guide the model's binary decision-making, testing both zero-shot and fine-tuned performance. Results indicate that while MLLMs perform poorly in zero-shot mode, they achieve strong detection performance on in-domain data after minimal task-specific fine-tuning.

Abstract

While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.

Key findings

In zero-shot evaluation, MLLMs perform poorly, close to chance level. When fine-tuned using minimal supervision, the models achieve high performance on in-domain data (up to 0.98 Macro F1 on Seval), demonstrating promising potential for detection. However, their ability to generalize to out-of-domain data like ITW remains limited, suggesting difficulties in adapting to diverse and challenging spoofing scenarios.

Approach

The audio deepfake detection task is reformulated as an Audio Question-Answering (AQA) problem where the MLLMs output a binary text response ('bonafide' or 'spoof'). They use diverse text prompts (binary direct, yes/no, and context-rich descriptive) combined with the audio input, evaluating performance in zero-shot mode and fine-tuned mode using LoRA for efficient adaptation.

Datasets

ASVspoof 2019 Logical Access (ASV19 LA), In-the-Wild (ITW) dataset

Model(s)

Qwen2-Audio-7B-Instruct, SALMONN-13B

Author countries

India, Australia

← Previous