A Multimodal Framework for Deepfake Detection

Authors: Kashish Gandhi, Prutha Kulkarni, Taran Shah, Piyush Chaudhari, Meera Narvekar, Kranti Ghag

Published: 2024-10-04 14:59:10+00:00

AI Summary

This research proposes a multimodal deepfake detection framework that integrates visual and auditory analyses. The framework analyzes nine facial characteristics from videos and mel-spectrograms from audio, achieving an accuracy of 94% by classifying a sample as deepfake if either component is identified as such.

Abstract

The rapid advancement of deepfake technology poses a significant threat to digital media integrity. Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. This creates risks of misinformation, fraud, and severe implications for personal privacy and security. Our research addresses the critical issue of deepfakes through an innovative multimodal approach, targeting both visual and auditory elements. This comprehensive strategy recognizes that human perception integrates multiple sensory inputs, particularly visual and auditory information, to form a complete understanding of media content. For visual analysis, a model that employs advanced feature extraction techniques was developed, extracting nine distinct facial characteristics and then applying various machine learning and deep learning models. For auditory analysis, our model leverages mel-spectrogram analysis for feature extraction and then applies various machine learning and deep learningmodels. To achieve a combined analysis, real and deepfake audio in the original dataset were swapped for testing purposes and ensured balanced samples. Using our proposed models for video and audio classification i.e. Artificial Neural Network and VGG19, the overall sample is classified as deepfake if either component is identified as such. Our multimodal framework combines visual and auditory analyses, yielding an accuracy of 94%.


Key findings
The multimodal framework achieved an overall accuracy of 94% in deepfake detection. The ANN model for video achieved 93% accuracy, while the VGG19 model for audio reached 98% accuracy. The proposed method outperforms unimodal approaches and other traditional machine learning methods tested.
Approach
The approach extracts nine facial features from videos (using Haar Cascade and FaceMesh) and mel-spectrograms from audio. These features are then fed into separate machine learning and deep learning models (ANN, VGG19, Random Forest, XGBoost) for video and audio classification respectively. The final classification is determined by a logical OR of the individual modality classifications.
Datasets
DFDC Dataset (video), Fake-or-Real (FoR) dataset (audio)
Model(s)
Artificial Neural Network (ANN), VGG19, Random Forest, XGBoost, Decision Trees, Bagging
Author countries
India