Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes

Authors: Gautam Siddharth Kashyap, Harsh Joshi, Niharika Jain, Ebad Shabbir, Jiechao Gao, Nipun Joshi, Usman Naseem

Published: 2026-01-24 17:07:51+00:00

Comment: Accepted at EACL Findings 2026

AI Summary

This paper introduces ConLLM (Contrastive Learning with Large Language Models), a hybrid framework designed for robust multimodal deepfake detection. ConLLM addresses modality fragmentation and shallow inter-modal reasoning through a two-stage architecture. It first extracts modality-specific embeddings using Pre-Trained Models (PTMs), then aligns these embeddings via contrastive learning and refines them with LLM-based reasoning to capture subtle semantic inconsistencies.

Abstract

The rapid rise of deepfake technology poses a severe threat to social and political stability by enabling hyper-realistic synthetic media capable of manipulating public perception. However, existing detection methods struggle with two core limitations: (1) modality fragmentation, which leads to poor generalization across diverse and adversarial deepfake modalities; and (2) shallow inter-modal reasoning, resulting in limited detection of fine-grained semantic inconsistencies. To address these, we propose ConLLM (Contrastive Learning with Large Language Models), a hybrid framework for robust multimodal deepfake detection. ConLLM employs a two-stage architecture: stage 1 uses Pre-Trained Models (PTMs) to extract modality-specific embeddings; stage 2 aligns these embeddings via contrastive learning to mitigate modality fragmentation, and refines them using LLM-based reasoning to address shallow inter-modal reasoning by capturing semantic inconsistencies. ConLLM demonstrates strong performance across audio, video, and audio-visual modalities. It reduces audio deepfake EER by up to 50%, improves video accuracy by up to 8%, and achieves approximately 9% accuracy gains in audio-visual tasks. Ablation studies confirm that PTM-based embeddings contribute 9%-10% consistent improvements across modalities.


Key findings
ConLLM significantly outperforms state-of-the-art methods across all modalities, achieving up to 50% EER reduction in audio, 8% accuracy improvement in video, and 9% accuracy gains in audio-visual deepfake detection. Ablation studies confirmed the critical contributions of PTMs, contrastive learning, and LLM-based embedding refinement. Additionally, ConLLM demonstrated superior computational efficiency and strong cross-lingual generalization compared to other multimodal models.
Approach
ConLLM employs a two-stage architecture: Stage 1 utilizes Pre-Trained Models (PTMs) like XLS-R for audio, VideoMAE for video, and VATLM for audio-visual data to extract modality-specific embeddings and project them into a shared latent space. Stage 2 applies contrastive learning to align these embeddings and then uses a GPT-style transformer for LLM-based semantic reasoning to capture fine-grained inconsistencies, before passing them to a classification head.
Datasets
ASVSpoof 2019 (LA), DECRO (D-E, D-C), Celeb-DF (CDF), WildDeepfake (WD), FakeAVCeleb (FAFC), DeepFake Detection Challenge (DFDC)
Model(s)
ConLLM, which integrates XLS-R (for audio), VideoMAE (for video), VATLM (for audio-visual), contrastive learning, and a GPT-style transformer architecture.
Author countries
Australia, India, USA