Leveraging large multimodal models for audio-video deepfake detection: a pilot study
Authors: Songjun Cao, Yuqi Li, Yunpeng Luo, Jianjun Yin, Long Ma
Published: 2026-02-25 04:39:08+00:00
Comment: 5pages,ICASSP2026
AI Summary
This paper introduces AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model for audio-visual deepfake detection (AVD). It reformulates AVD as a prompted yes/no classification task and is built upon Qwen 2.5 Omni. The model employs a two-stage training strategy involving LoRA alignment and full audio-visual encoder fine-tuning, achieving competitive or state-of-the-art results on FakeAVCeleb and Mavos-DD datasets.
Abstract
Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - Is this video real or fake?. Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.