AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

View on arXiv ← Back to list

Authors: Shuhan Xia, Peipei Li, Xuannan Liu, Dongsen Zhang, Xinyu Guo, Zekun Li

Published: 2025-11-26 10:33:12+00:00

AI Summary

AVFakeBench is introduced as the first comprehensive audio-video forgery detection benchmark, featuring 12K questions covering seven multi-modal forgery types and four annotation levels across human and general subjects. The benchmark utilizes a hybrid forgery framework to ensure high quality and is used to evaluate 11 AV-LMMs and 2 expert detectors. Results demonstrate AV-LMMs' potential as unified detectors but expose significant limitations in fine-grained perception and explanatory reasoning.

Abstract

The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.

Key findings

AV-LMMs exhibit strong potential as unified forgery detectors, surpassing expert detection models on binary authenticity judgment across subjects. However, AV-LMMs suffer a significant performance drop (up to 70%) on complex tasks like forgery classification, particularly struggling with subtle editing manipulations. This indicates a weakness in fine-grained perception, compounded by poor performance on explanatory reasoning tasks like forgery detail selection.

Approach

The authors introduce AVFakeBench, constructed using a multi-stage hybrid forgery framework that leverages proprietary models for forgery task planning and expert generative models (like KLING, QingYing, and LipVoicer) for precise manipulation. The evaluation framework establishes four tasks: binary judgment, forgery types classification, forgery detail selection, and open-ended explanatory reasoning, designed to test the capabilities of Audio-Video Large Language Models (AV-LMMs).

Datasets

AVFakeBench (constructed from sources including DDL, DigiFakeAV, AVDeepFake1M, LAVDF, and VGGSound)

Model(s)

GPT-4o, Gemini series (Gemini-2.0/2.5 flash/lite/pro), PandaGPT, OneLLM, VideoLLaMA2, video-SALMONN, AVicuna, LipFD, AVH-Align

Author countries

China, USA

← Previous