Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Authors: Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa, Hugo Proença, Tiago Roxo

Published: 2026-04-30 15:40:56+00:00

Comment: Submitted to IJCB 2026

AI Summary

This paper introduces a novel five-class audio-visual DeepFake detection formulation by adding a 'Real Audio-Real Video with Semantic Mismatch (RARV-SMM)' class to address the semantic inconsistency challenge. It demonstrates that state-of-the-art models often rely on data source integrity and fail to detect semantic mismatches between authentic modalities. The authors propose a semantic reinforcement strategy using ImageBind embeddings to improve detection robustness in this more realistic DeepFake setting.

Abstract

Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formulations address this by discriminating manipulation type, but introduce a unresolved problem: models may rely solely on data source integrity to detect DeepFakes without evaluating their semantic consistency. If the DeepFake origin is not in the data source but in its content, can semantic mismatch be assessed by the state-of-the-art? This paper proposes a new evaluation setup, extending the four-class formulation by explicitly modeling semantic-level inconsistency between authentic modalities with the introduction a new class: Real Audio-Real Video with Semantic Mismatch (RARV-SMM). We assess the robustness of state-of-the-art models in this new realistic DeepFake setting, using the FakeAVCeleb dataset, highlighting the limitations of existing approaches when faced with semantic mismatch data. We further introduce three RARV-SMM variants that expose distinct architectural vulnerabilities as audio-visual divergence increases. We also propose a semantic reinforcement strategy that incorporates the semantic mismatch class and ImageBind embeddings to improve DeepFake detection in both our proposed and state-of-the-art settings, on FakeAVCeleb and LAV-DF, paving the way to more realistic DeepFake detectors. The source code and data are available at https://github.com/.


Key findings
Existing state-of-the-art DeepFake detection models largely fail to detect semantic inconsistencies (RARV-SMM), revealing their reliance on signal-level artifacts or data source integrity. Explicit training with the new RARV-SMM class improves performance for cross-modal and distance-based architectures like FGMDF and FGI, but not for lip-sync grounded models like AVDF. The proposed semantic reinforcement strategy significantly enhances detection robustness across different architectural types, especially for challenging semantic mismatch variants and in state-of-the-art settings.
Approach
The authors extend the four-class DeepFake detection problem by introducing a fifth class, RARV-SMM, which consists of authentic audio and video clips that are semantically inconsistent, generated from VoxCeleb2 with three variants of increasing divergence. To address this, they propose a model-agnostic semantic reinforcement strategy. This strategy augments existing DeepFake classifiers by concatenating a frozen ImageBind cosine similarity score, which measures semantic coherence between audio and video streams, to their fusion output.
Datasets
FakeAVCeleb, VoxCeleb2 (for RARV-SMM creation), LAV-DF
Model(s)
FGMDF, FGI, AVDF (fine-tunes AV-HuBERT), ImageBind (for semantic reinforcement)
Author countries
Portugal