Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated Samples

Authors: Konstantinos Tsigos, Evlampios Apostolidis, Vasileios Mezaris

Published: 2025-02-06 10:47:34+00:00

Comment: Accepted for publication, AI4MFDD Workshop @ IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, Feb. 2025. This is the authors' "accepted version"

AI Summary

This paper introduces a novel perturbation approach for explaining deepfake detectors, which leverages adversarially-generated samples of input images. These samples are created using Natural Evolution Strategies to successfully trick the deepfake detector into classifying them as 'real,' thereby forming more effective perturbation masks. The method is applied to and evaluated with four state-of-the-art perturbation-based explanation techniques, demonstrating enhanced performance in identifying manipulated regions.

Abstract

In this paper, we introduce the idea of using adversarially-generated samples of the input images that were classified as deepfakes by a detector, to form perturbation masks for inferring the importance of different input features and produce visual explanations. We generate these samples based on Natural Evolution Strategies, aiming to flip the original deepfake detector's decision and classify these samples as real. We apply this idea to four perturbation-based explanation methods (LIME, SHAP, SOBOL and RISE) and evaluate the performance of the resulting modified methods using a SOTA deepfake detection model, a benchmarking dataset (FaceForensics++) and a corresponding explanation evaluation framework. Our quantitative assessments document the mostly positive contribution of the proposed perturbation approach in the performance of explanation methods. Our qualitative analysis shows the capacity of the modified explanation methods to demarcate the manipulated image regions more accurately, and thus to provide more useful explanations.


Key findings
The quantitative evaluation showed a mostly positive contribution of the proposed adversarial perturbation approach, leading to a larger drop in detector accuracy and higher sufficiency scores for the modified explanation methods, especially LIMEadv. Qualitatively, the modified methods were found to demarcate manipulated image regions more accurately, providing more meaningful and useful visual explanations of deepfake detector decisions.
Approach
The authors propose generating adversarial samples of deepfake images using Natural Evolution Strategies (NES), which are designed to flip the deepfake detector's decision from 'deepfake' to 'real' with minimal visual distortion. These adversarially-generated samples are then used as perturbation masks within existing explanation methods (LIME, SHAP, SOBOL, RISE) to infer the importance of different input features and produce more accurate visual explanations of deepfake detector decisions.
Datasets
FaceForensics++
Model(s)
EfficientNetV2 (for deepfake detection)
Author countries
Greece