Beyond Deepfake Images: Detecting AI-Generated Videos

Authors: Danial Samadi Vahdati, Tai D. Nguyen, Aref Azizpour, Matthew C. Stamm

Published: 2024-04-24 16:19:31+00:00

Comment: To be published in CVPRW24

AI Summary

This paper demonstrates that existing AI-generated image detectors are ineffective against synthetic videos due to distinct forensic traces introduced by video generators. However, it shows that these unique video traces can be effectively learned using existing CNN architectures for robust synthetic video detection and generator attribution, even after H.264 re-compression. Furthermore, the approach enables accurate detection of new generators through few-shot learning.

Abstract

Recent advances in generative AI have led to the development of techniques to generate visually realistic synthetic video. While a number of techniques have been developed to detect AI-generated synthetic images, in this paper we show that synthetic image detectors are unable to detect synthetic videos. We demonstrate that this is because synthetic video generators introduce substantially different traces than those left by image generators. Despite this, we show that synthetic video traces can be learned, and used to perform reliable synthetic video detection or generator source attribution even after H.264 re-compression. Furthermore, we demonstrate that while detecting videos from new generators through zero-shot transferability is challenging, accurate detection of videos from a new generator can be achieved through few-shot learning.


Key findings
Existing synthetic image detectors are ineffective for AI-generated videos due to distinct forensic traces left by video generators. However, these unique video traces can be learned by existing CNN architectures for accurate detection and source attribution, showing strong performance (e.g., MISLnet achieved 0.983 AUC) and robustness to H.264 re-compression. Crucially, while zero-shot detection of new generators is poor, few-shot learning enables high accuracy (AUC > 0.98) with minimal new data.
Approach
The authors train existing Convolutional Neural Networks (CNNs) on a new dataset of real and synthetic videos to learn unique forensic traces left by video generators. They employ robust training against H.264 re-compression and leverage video-level detection by aggregating patch-level embeddings. The approach also integrates few-shot learning for adapting to novel, unseen video generators.
Datasets
Moments in Time (MiT), Video-ACID, Luma, VideoCrafter-v1, CogVideo, Stable Video Diffusion, Sora, Pika, VideoCrafter-v2.
Model(s)
ResNet-50, DIF, Swin-Transformer, ResNet-34, VGG-16, Xception, DenseNet, MISLnet.
Author countries
USA