Beyond Real versus Fake Towards Intent-Aware Video Analysis

Authors: Saurabh Atreya, Nabyl Quignon, Baptiste Chopin, Abhijit Das, Antitza Dantcheva

Published: 2025-11-27 13:44:06+00:00

AI Summary

The paper addresses the limitation of traditional deepfake detection by introducing the novel task of intent recognition in videos, moving beyond binary real/fake classification towards contextual understanding. The authors introduce IntentHQ, a new benchmark dataset comprising 5168 human-centric videos meticulously annotated with 23 fine-grained intent categories. They propose a multi-modal approach integrating video, audio, and text features, using a three-way contrastive self-supervised pre-training framework to align modalities for enhanced intent classification.

Abstract

The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including Financial fraud, Indirect marketing, Political propaganda, as well as Fear mongering. We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.


Key findings
The proposed self-supervised learning approach achieved the state-of-the-art performance on IntentHQ, reaching 52.5% accuracy for 23-class intent recognition, significantly outperforming supervised baselines (max 44.3%). Ablation studies demonstrated that the video modality was the most predictive feature source, and the model showed high accuracy (75.56%) when classifying the binary malicious versus benign intent task. Performance was notably higher on classes with structured cues (e.g., Persuasion) and lower on subtle, socially embedded intents (e.g., Social Engineering).
Approach
The core approach involves a three-way contrastive self-supervised learning pipeline that aligns features from video, audio, and text modalities in a shared latent space using the InfoNCE loss. This pre-trained model, utilizing modality-specific encoders (CLIP ViT-L/14, WavLM, CLIP Text), is then fine-tuned with a lightweight MLP classifier to predict one of the 23 specific intent categories.
Datasets
IntentHQ (5168 videos)
Model(s)
CLIP ViT-L/14, WavLM, CLIP Text Encoder, MLP classifier, Transformer Decoder (for cross-attention baseline)
Author countries
India, France, Germany