How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

View on arXiv ← Back to list

Authors: Chirui Chang, Jiahui Liu, Zhengzhe Liu, Xiaoyang Lyu, Yi-Hua Huang, Xin Tao, Pengfei Wan, Di Zhang, Xiaojuan Qi

Published: 2024-06-27 23:03:58+00:00

Comment: ICCV 2025

AI Summary

This paper introduces Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing the 3D visual qualities and consistencies of AI-generated videos. L3DE employs a 3D convolutional network, trained on monocular 3D cues of motion, depth, and appearance, to distinguish real from synthetic videos. It quantifies simulation gaps and pinpoints unrealistic regions, demonstrating strong alignment with 3D reconstruction quality and human judgments.

Abstract

Recent advancements in video diffusion models enable the generation of photorealistic videos with impressive 3D consistency and temporal coherence. However, the extent to which these AI-generated videos simulate the 3D visual world remains underexplored. In this paper, we introduce Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing AI-generated videos' ability to simulate the real world in terms of 3D visual qualities and consistencies, without requiring manually labeled defects or quality annotations. Instead of relying on 3D reconstruction, which is prone to failure with in-the-wild videos, L3DE employs a 3D convolutional network, trained on monocular 3D cues of motion, depth, and appearance, to distinguish real from synthetic videos. Confidence scores from L3DE quantify the gap between real and synthetic videos in terms of 3D visual coherence, while a gradient-based visualization pinpoints unrealistic regions, improving interpretability. We validate L3DE through extensive experiments, demonstrating strong alignment with 3D reconstruction quality and human judgments. Our evaluations on leading generative models (e.g., Kling, Sora, and MiniMax) reveal persistent simulation gaps and subtle inconsistencies. Beyond generative video assessment, L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies. Project page: https://justin-crchang.github.io/l3de-project-page/

Key findings

L3DE reliably quantifies the 3D visual coherence of generative videos, showing strong correlation with both 3D reconstruction quality and human perception. Evaluations of leading generative models reveal persistent simulation gaps and subtle inconsistencies, particularly in motion and geometry, compared to real videos. Additionally, L3DE proves effective as a video deepfake detector and can guide video synthesis improvements by localizing artifact regions.

Approach

L3DE utilizes a 3D convolutional network, trained with a contrastive learning objective, to learn intrinsic differences between real and synthetic videos. It extracts monocular 3D cues—appearance (DINOv2 features), motion (RAFT optical flow), and geometry (UniDepth metric depth)—from foundation models as input. The model provides confidence scores for 3D visual coherence and gradient-based visualizations (Grad-CAM) to highlight inconsistent regions.

Datasets

Pexels (real videos), Stable Video Diffusion (SVD) generated videos, Kling 1.5 generated videos, videos from public static datasets (Mip-NeRF360, Tanks-and-Temples), videos from dynamic datasets (Hyper-NeRF, Neural 3D Video Synthesis Dataset), and synthetic videos from Runway-Gen3, MiniMax, Vidu, Luma Dream Machine 1.6, CogVideoX-5B, Sora, Kling 2.1. The EvalCrafter Text-to-Video (ECTV) Dataset was also used for comparisons.

Model(s)

UNKNOWN

Author countries

Hong Kong, China

← Previous