MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

Authors: Mengxue Hu, Yunfeng Diao, Changtao Miao, Jianshu Li, Zhe Li, Joey Tianyi Zhou

Published: 2025-11-29 05:59:38+00:00

AI Summary

The paper introduces the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset designed for detecting general AI-generated multimodal video-audio content (AIGC). It addresses the critical gap left by existing datasets that focus predominantly on visual modality or narrow facial deepfakes. MVAD features high perceptual quality and diversity, simulating three realistic forgery patterns across diverse content categories and visual styles.

Abstract

The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.


Key findings
MVAD comprises 215,758 multimodal samples, maintaining a 1:1 ratio between forged and authentic content across four modality combinations (F-F, F-R, R-F, R-R). Comparative analysis using the VBench framework demonstrated that MVAD consistently achieves superior sample quality across metrics like aesthetic quality, motion smoothness, and temporal flickering, surpassing existing leading AI-generated video datasets. The dataset covers high diversity, spanning realistic and anime visual domains and four content categories (humans, animals, objects, scenes).
Approach
The MVAD dataset is constructed via a pipeline involving data collection, multi-stage data generation, and a three-tiered quality evaluation. Forgery samples simulate three realistic patterns (Fake-Fake, Fake-Real, Real-Fake) utilizing over twenty state-of-the-art generative models like Sora 2 and Kling. Quality is ensured using automated VBench metrics, Large Multimodal Model assessment (LMMs), and rigorous human expert verification.
Datasets
MVAD (newly introduced), Ugc-VideoCaptioner, HarmonySet, TalkVid, Msvd, OpenVid-1M, InternVid-10M, MSR-VTT (for sourcing real data).
Model(s)
Sora 2 (OpenAI), Voe 3/3.1 (Veo3 AI), Kling (1.6/2.1/2.5 Turbo), MoonValley, Pika, Gen3, Humo, Wan 2.1, MMAudio, AudioX, FoleyCrafter. Architectures mentioned include Diffusion Transformer (DiT), LDM, and LMMs (DeepSeek, GPT-4o) for filtering.
Author countries
China, Singapore