AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

Authors: Zixuan Chen, Depeng Wang, Hao Lin, Li Luo, Ke Xu, Ya Guo, Huijia Zhu, Tanfeng Sun, Xinghao Jiang

Published: 2026-04-15 07:57:04+00:00

AI Summary

This paper introduces AVID, the first large-scale benchmark designed for audio-visual inconsistency understanding in long-form videos. AVID comprises a scalable construction pipeline to generate diverse audio-visual conflicts across 11.2K long videos and 78.7K segment clips, enabling comprehensive evaluation of detection, temporal grounding, classification, and reasoning. Evaluations reveal that state-of-the-art omni-modal models significantly struggle with these tasks, while their fine-tuned baseline, AVID-Qwen, achieves substantial improvements.

Abstract

We present AVID, the first large-scale benchmark for audio-visual inconsistency understanding in videos. While omni-modal large language models excel at temporally aligned tasks such as captioning and question answering, they struggle to perceive cross-modal conflicts, a fundamental human capability that is critical for trustworthy AI. Existing benchmarks predominantly focus on aligned events or deepfake detection, leaving a significant gap in evaluating inconsistency perception in long-form video contexts. AVID addresses this with: (1) a scalable construction pipeline comprising temporal segmentation that classifies video content into Active Speaker, Voiceover, and Scenic categories; an agent-driven strategy planner that selects semantically appropriate inconsistency categories; and five specialized injectors for diverse audio-visual conflict injection; (2) 11.2K long videos (avg. 235.5s) with 39.4K annotated inconsistency events and 78.7K segment clips, supporting evaluation across detection, temporal grounding, classification, and reasoning with 8 fine-grained inconsistency categories. Comprehensive evaluations of state-of-the-art omni-models reveal significant limitations in temporal grounding and reasoning. Our fine-tuned baseline, AVID-Qwen, achieves substantial improvements over the base model (2.8$\\times$ higher BLEU-4 in segment reasoning) and surpasses all compared models in temporal grounding (mIoU: 36.1\\% vs 26.2\\%) and holistic understanding (SODA-m: 7.47 vs 6.15), validating AVID as an effective testbed for advancing trustworthy omni-modal AI systems.


Key findings
State-of-the-art omni-models exhibit significant limitations in understanding audio-visual inconsistencies, particularly in temporal grounding and reasoning tasks. The fine-tuned AVID-Qwen baseline substantially improves performance over existing models, achieving a 2.8x higher BLEU-4 in segment reasoning, 36.1% mIoU in temporal grounding, and 7.47 SODA-m in holistic understanding, validating AVID as an effective testbed for advancing trustworthy omni-modal AI.
Approach
The AVID benchmark is constructed via a scalable pipeline involving three stages: temporal segmentation that classifies video content into Active Speaker, Voiceover, and Scenic categories; an agent-driven strategy planner that selects semantically appropriate inconsistency categories; and five specialized injectors for diverse audio-visual conflict injection.
Datasets
AVID (11.2K long videos, 39.4K annotated inconsistency events, 78.7K segment clips). The videos were sourced from the LongVALE collection.
Model(s)
Qwen3-Omni (as backbone for the fine-tuned AVID-Qwen baseline), Gemini 2.5 Pro, Gemini 3.1 Pro, MiMo-V2-Omni, OLA, Video-SALMONN 2 (7B, 72B), ARC-Hunyuan-Video.
Author countries
China