ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection

Authors: Mohammad Romani

Published: 2025-11-18 14:56:34+00:00

AI Summary

ForensicFlow is a tri-modal adaptive network designed for robust video Deepfake detection, integrating evidence from three complementary domains: RGB, texture, and frequency. The architecture utilizes state-of-the-art backbones and attention-based mechanisms for temporal pooling and dynamic feature fusion. It achieved high performance on the Celeb-DF (v2) dataset, demonstrating superior resilience against subtle forgeries compared to single-stream baselines.

Abstract

Deepfakes generated by advanced GANs and autoencoders severely threaten information integrity and societal stability. Single-stream CNNs fail to capture multi-scale forgery artifacts across spatial, texture, and frequency domains, limiting robustness and generalization. We introduce the ForensicFlow, a tri-modal forensic framework that synergistically fuses RGB, texture, and frequency evidence for video Deepfake detection. The RGB branch (ConvNeXt-tiny) extracts global visual inconsistencies; the texture branch (Swin Transformer-tiny) detects fine-grained blending artifacts; the frequency branch (CNN + SE) identifies periodic spectral noise. Attention-based temporal pooling dynamically prioritizes high-evidence frames, while adaptive attention fusion balances branch contributions.Trained on Celeb-DF (v2) with Focal Loss, ForensicFlow achieves AUC 0.9752, F1-Score 0.9408, and accuracy 0.9208, outperforming single-stream baselines. Ablation validates branch synergy; Grad-CAM confirms forensic focus. This comprehensive feature fusion provides superior resilience against subtle forgeries.


Key findings
ForensicFlow achieved strong results on Celeb-DF (v2) with an AUC of 0.9752, an F1-Score of 0.9408, and an accuracy of 0.9208, outperforming tested single-stream baselines. Ablation studies confirmed that the synergy between the three specialized forensic branches is crucial for maximizing detection performance and generalization ability.
Approach
The method uses three parallel branches: an RGB branch (ConvNeXt-tiny), a Texture branch (Swin Transformer-tiny), and a Frequency branch (CNN + SE) to extract complementary forgery artifacts. Features from video frames are aggregated using Attention-based Temporal Pooling to prioritize high-evidence frames, and the final branch features are combined via adaptive attention fusion for classification.
Datasets
Celeb-DF (v2)
Model(s)
ConvNeXt-tiny, Swin Transformer-tiny, CNN with Squeeze-and-Excitation (SE) Block
Author countries
Iran