Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Authors: Songjun Cao, Yuqi Li, Yunpeng Luo, Jianjun Yin, Long Ma

Published: 2026-02-25 04:39:08+00:00

Comment: 5pages,ICASSP2026

AI Summary

This paper introduces AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model for audio-visual deepfake detection (AVD). It reformulates AVD as a prompted yes/no classification task and is built upon Qwen 2.5 Omni. The model employs a two-stage training strategy involving LoRA alignment and full audio-visual encoder fine-tuning, achieving competitive or state-of-the-art results on FakeAVCeleb and Mavos-DD datasets.

Abstract

Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - Is this video real or fake?. Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.

Key findings

AV-LMMDetect achieved comparable performance to the current state-of-the-art on FakeAVCeleb and set a new state of the art on the MAVOS-DD dataset, particularly excelling in challenging open-set generalization scenarios. The two-stage training strategy (LoRA alignment then full encoder fine-tuning) was crucial for optimal performance, demonstrating superior generalization and robust cross-modal detection capabilities compared to other methods.

Approach

The approach, AV-LMMDetect, casts audio-visual deepfake detection as a prompted binary question answering task using a supervised fine-tuned large multimodal model. It jointly analyzes audio and visual streams, built on Qwen 2.5 Omni. Training involves a two-stage regimen: initial lightweight LoRA alignment followed by full fine-tuning of the audio-visual encoders.

Datasets

FakeAVCeleb, MAVOS-DD

Model(s)

AV-LMMDetect (based on Qwen 2.5 Omni)

Author countries

China

← Previous