Investigating the Impact of Pre-processing and Prediction Aggregation on the DeepFake Detection Task

Authors: Polychronis Charitidis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Kompatsiaris

Published: 2020-06-12 11:16:02+00:00

AI Summary

This paper investigates the impact of dataset pre-processing and prediction aggregation on DeepFake detection performance. It proposes a pre-processing step that leverages face recognition embeddings to improve training data quality by identifying and removing false face detections. The study also evaluates various video-level prediction aggregation schemes, demonstrating that the proposed pre-processing significantly enhances detection models and the specialized 'Face' aggregation method boosts efficiency, especially in videos with multiple faces.

Abstract

Recent advances in content generation technologies (widely known as DeepFakes) along with the online proliferation of manipulated media content render the detection of such manipulations a task of increasing importance. Even though there are many DeepFake detection methods, only a few focus on the impact of dataset preprocessing and the aggregation of frame-level to video-level prediction on model performance. In this paper, we propose a pre-processing step to improve the training data quality and examine its effect on the performance of DeepFake detection. We also propose and evaluate the effect of video-level prediction aggregation approaches. Experimental results show that the proposed pre-processing approach leads to considerable improvements in the performance of detection models, and the proposed prediction aggregation scheme further boosts the detection efficiency in cases where there are multiple faces in a video.


Key findings
The proposed pre-processing significantly improved DeepFake detection model performance across all evaluated architectures and datasets, with gains of 5-13% in Log loss error. The 'Face' prediction aggregation method consistently achieved better results than other aggregation baselines, especially in videos containing multiple faces. The study also highlighted that current detection models lack generalization to unseen manipulations, performing poorly on datasets with different manipulation types than their training data.
Approach
The authors propose a pre-processing step involving a face recognition model to compute facial embeddings for detected faces. These embeddings are used to calculate similarities and form connected components; small components (likely false detections) are removed to clean the dataset. For video-level prediction, they evaluate aggregation methods including averaging, median, maximum, and a 'Face' aggregation that averages predictions per face cluster before taking the maximum.
Datasets
DFDC (DeepFake Detection Challenge), Celeb-DF, FaceForensics++
Model(s)
MesoInception-4, XceptionNet, EfficientNet-B4
Author countries
Greece