DeepAgent: A Dual Stream Multi Agent Fusion for Robust Multimodal Deepfake Detection

Authors: Sayeem Been Zaman, Wasimul Karim, Arefin Ittesafun Abian, Reem E. Mohamed, Md Rafiqul Islam, Asif Karim, Sami Azam

Published: 2025-12-08 09:43:30+00:00

AI Summary

DeepAgent is a novel multi-agent collaboration framework designed for robust multimodal deepfake detection. It utilizes two specialized agents: Agent-1, a lightweight AlexNet-based CNN for visual artifact detection, and Agent-2, which focuses on audio-visual semantic inconsistency using acoustic features, Whisper transcripts, and OCR. The final decision is reached by fusing the agents' outputs through a Random Forest meta-classifier.

Abstract

The increasing use of synthetic media, particularly deepfakes, is an emerging challenge for digital content verification. Although recent studies use both audio and visual information, most integrate these cues within a single model, which remains vulnerable to modality mismatches, noise, and manipulation. To address this gap, we propose DeepAgent, an advanced multi-agent collaboration framework that simultaneously incorporates both visual and audio modalities for the effective detection of deepfakes. DeepAgent consists of two complementary agents. Agent-1 examines each video with a streamlined AlexNet-based CNN to identify the symbols of deepfake manipulation, while Agent-2 detects audio-visual inconsistencies by combining acoustic features, audio transcriptions from Whisper, and frame-reading sequences of images through EasyOCR. Their decisions are fused through a Random Forest meta-classifier that improves final performance by taking advantage of the different decision boundaries learned by each agent. This study evaluates the proposed framework using three benchmark datasets to demonstrate both component-level and fused performance. Agent-1 achieves a test accuracy of 94.35% on the combined Celeb-DF and FakeAVCeleb datasets. On the FakeAVCeleb dataset, Agent-2 and the final meta-classifier attain accuracies of 93.69% and 81.56%, respectively. In addition, cross-dataset validation on DeepFakeTIMIT confirms the robustness of the meta-classifier, which achieves a final accuracy of 97.49%, and indicates a strong capability across diverse datasets. These findings confirm that hierarchy-based fusion enhances robustness by mitigating the weaknesses of individual modalities and demonstrate the effectiveness of a multi-agent approach in addressing diverse types of manipulations in deepfakes.


Key findings
Agent-1 achieved a test accuracy of 94.35% on combined Celeb-DF and FakeAVCeleb datasets, while Agent-2 achieved 93.69% accuracy on FakeAVCeleb. The final Random Forest meta-classifier demonstrated high generalization capability, achieving 97.49% accuracy during cross-dataset validation on DeepFakeTIMIT.
Approach
The system employs two specialized agents: Agent-1, a streamlined AlexNet-based CNN detecting visual artifacts, and Agent-2, which detects audio-visual semantic inconsistency using MFCC acoustic features, Whisper ASR transcripts, and EasyOCR extracted frame text. The probability outputs of these agents are concatenated into a meta-feature vector, which is fed into a Random Forest ensemble classifier for decision-level fusion.
Datasets
Celeb-DF, FakeAVCeleb, DeepFakeTIMIT
Model(s)
AlexNet-based CNN, MFCC features, Whisper, EasyOCR, Deep Neural Network (DNN), Random Forest
Author countries
Bangladesh, Australia