Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks

Authors: Mohammadreza Rashidi, Raja Hashim Ali, Sami Ur Rahman

Published: 2026-05-17 18:01:32+00:00

Comment: 13 pages, 6 figures

AI Summary

The paper addresses the degradation of frame-level deepfake detectors against high-quality synthetic facial videos by proposing a 3D Convolutional Neural Network (R3D-18) that leverages temporal inconsistencies. This approach, trained with a composite loss including a temporal-consistency regularizer, demonstrates superior intra-dataset accuracy and better cross-dataset generalization, confirming that temporal artifacts are a robust detection signal.

Abstract

Synthetic facial videos have proliferated across social media faster than platform moderation can respond, raising the cost of disinformation and identity-based attacks. Frame-level deepfake detectors degrade sharply as generator quality increases; high-quality 128x128 GAN output cuts spatial-only accuracy by five percentage points while leaving temporal inconsistencies largely intact. We address this gap with a 3D Convolutional Neural Network detector based on R3D-18, trained with a composite loss that combines binary cross-entropy with a temporal-consistency regularizer. The model processes 16-frame clips from the DeepfakeTIMIT dataset and is initialized from Kinetics-400 action-recognition weights. We report 92.8% accuracy on intra-dataset evaluation at 128x128 resolution; cross-dataset transfer to FaceForensics++ without fine-tuning reaches 76.4%, rising after minimal fine-tuning. Ablation studies show that transfer learning contributes 7.2 percentage points and face tracking adds 3.5 points, while temporal consistency regularization provides additional gains on high-quality fakes. The results establish that temporal artifacts generalize more broadly than spatial ones, providing a detection signal that survives social-media re-encoding.


Key findings
The proposed 3D CNN achieved 94.2% accuracy on DeepfakeTIMIT and maintained 92.8% on high-quality deepfakes, significantly outperforming spatial-only methods like XceptionNet (89.7%). It demonstrated strong cross-dataset generalization with 76.4% accuracy on FaceForensics++ without fine-tuning, validating that temporal artifacts are more generalizable. Ablation studies confirmed significant contributions from transfer learning (+7.2 pp) and face tracking (+3.5 pp), with eye blinking and micro-expression transitions identified as key discriminative temporal artifacts.
Approach
The authors utilize a 3D Convolutional Neural Network (R3D-18) to analyze 16-frame video clips, focusing on temporal inconsistencies that span multiple frames. The model is initialized with Kinetics-400 pre-trained weights and fine-tuned using a composite loss that combines binary cross-entropy with a temporal-consistency regularizer to penalize frame-wise feature variations.
Datasets
DeepfakeTIMIT, FaceForensics++, Kinetics-400 (for pre-training)
Model(s)
UNKNOWN
Author countries
Germany