From Prediction to Explanation: Multimodal, Explainable, and Interactive Deepfake Detection Framework for Non-Expert Users

Authors: Shahroz Tariq, Simon S. Woo, Priyanka Singh, Irena Irmalasari, Saakshi Gupta, Dev Gupta

Published: 2025-08-11 03:55:47+00:00

Comment: 11 pages, 3 tables, 5 figures, accepted for publicaiton in the 33rd ACM International Conference on Multimedia (MM '25), October 27-31, 2025, Dublin, Ireland

AI Summary

This paper introduces DF-P2E, a novel multimodal framework for explainable deepfake detection, specifically designed for non-expert users. The framework integrates visual, semantic, and narrative layers of explanation by combining a deepfake classifier with Grad-CAM, a visual captioning module, and a fine-tuned Large Language Model (LLM). This approach aims to provide interpretable and accessible deepfake detection, offering competitive performance while delivering high-quality, human-aligned explanations.

Abstract

The proliferation of deepfake technologies poses urgent challenges and serious risks to digital integrity, particularly within critical sectors such as forensics, journalism, and the legal system. While existing detection systems have made significant progress in classification accuracy, they typically function as black-box models, offering limited transparency and minimal support for human reasoning. This lack of interpretability hinders their usability in real-world decision-making contexts, especially for non-expert users. In this paper, we present DF-P2E (Deepfake: Prediction to Explanation), a novel multimodal framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible. The framework consists of three modular components: (1) a deepfake classifier with Grad-CAM-based saliency visualisation, (2) a visual captioning module that generates natural language summaries of manipulated regions, and (3) a narrative refinement module that uses a fine-tuned Large Language Model (LLM) to produce context-aware, user-sensitive explanations. We instantiate and evaluate the framework on the DF40 benchmark, the most diverse deepfake dataset to date. Experiments demonstrate that our system achieves competitive detection performance while providing high-quality explanations aligned with Grad-CAM activations. By unifying prediction and explanation in a coherent, human-aligned pipeline, this work offers a scalable approach to interpretable deepfake detection, advancing the broader vision of trustworthy and transparent AI systems in adversarial media environments.


Key findings
The DF-P2E framework achieved competitive deepfake detection performance, with CLIP-large reaching an average AUC of 0.913 on the diverse DF40 dataset. A human evaluation with non-expert participants yielded high ratings for the system's explanations, averaging 4.5 for usefulness, 4.0 for understandability, and 4.0 for explainability on a 5-point Likert scale, indicating strong user trust and comprehension.
Approach
DF-P2E employs a three-modular approach: first, a deepfake classifier predicts manipulation probability and generates Grad-CAM saliency maps to identify suspicious regions. Second, a visual captioning module translates these saliency maps and the input image into natural language descriptions of forensic artifacts. Finally, a fine-tuned Large Language Model refines these initial captions into context-aware, user-sensitive narrative explanations, making the detection process transparent and understandable.
Datasets
DF40 (main benchmark), FaceForensics++, CelebDF, UADFV, VFHQ, FFHQ, CelebA, MSCOCO (for captioning model fine-tuning).
Model(s)
Deepfake Detection: XceptionNet, CLIP-base, CLIP-large. Visual Captioning: BLIP, BLIP2 (Flan-T5 variants), GIT, OFA, ViT-GPT2, PaliGemma. Narrative Refinement: LLaMA-3.2-11B-Vision.
Author countries
Australia, S. Korea