Attribution-Guided Multimodal Deepfake Detection via Cross-Modal Forensic Fingerprints

Authors: Wasim Ahmad, Wei Zhang, Xuerui Mao

Published: 2026-04-29 09:11:13+00:00

AI Summary

This paper introduces the Attribution-Guided Multimodal Deepfake Detection (AMDD) framework, which jointly learns to detect and attribute manipulation to specific generative models. AMDD treats generator attribution as a structured regularization, enforcing a stronger geometric constraint on the shared embedding space to capture forensically meaningful features. A novel Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss is proposed to align generator-induced artifacts across visual and audio streams.

Abstract

Audio-visual deepfakes have reached a level of realism that makes perceptual detection unreliable, threatening media integrity and biometric security. While multimodal detection has shown promise, most approaches are binary classification tasks that often latch onto dataset-specific artifacts rather than genuine generative traces. We argue that a detector incapable of identifying how a video was forged is likely learning the wrong signal. Unlike binary detection, attribution-guided learning imposes a stronger geometric constraint on the shared embedding space, forcing the model to encode generator-specific forensic content rather than shortcuts. We propose the Attribution-Guided Multimodal Deepfake Detection (AMDD) framework, which jointly learns to detect and attribute manipulation. AMDD treats generator attribution as a structured regularization that constrains representation geometry toward forensically meaningful features. We introduce a Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss to enforce alignment between generator-induced artifacts in visual and audio streams. This exploits the fact that coherent manipulation leaves correlated traces across modalities, grounded in the physical coupling between speech and facial articulation that synthetic pipelines routinely disrupt. Architecturally, we pair a ResNet50 with temporal attention for visual encoding against a pretrained ResNet18 for mel spectrograms, closing the encoder capacity gap found in prior models. On FakeAVCeleb, AMDD achieves 99.7% balanced accuracy and 99.8% AUC with 95.9% attribution accuracy. Cross-dataset evaluation on DeepfakeTIMIT, DFDM, and LAV-DF confirms that real video detection generalizes robustly, while fake detection on unseen generators remains an open challenge that we analyze in depth.


Key findings
AMDD achieved 99.7% balanced accuracy, 99.8% AUC, and 95.9% attribution accuracy on FakeAVCeleb, demonstrating superior performance and the ability to identify the deepfake generator. Cross-dataset evaluations showed robust generalization for real video detection, but fake detection on unseen generators remained an open challenge, indicating the model learns generator-specific forensic fingerprints rather than general forgery principles.
Approach
The AMDD framework uses a ResNet50 with temporal attention for visual encoding and a pretrained ResNet18 for mel spectrograms to ensure balanced encoder capacity. It jointly optimizes for binary deepfake detection and generator attribution using a combination of focal, cross-entropy, cross-modal contrastive, cross-modal forensic fingerprint consistency (CMFFC), and centroid regularization losses. Cross-modal attention aligns features between modalities.
Datasets
FakeAVCeleb, DeepfakeTIMIT, DFDM, LAV-DF
Model(s)
ResNet50 (for visual encoding), ResNet18 (for audio encoding), Multi-Head Attention (for temporal and cross-modal attention), 2-layer MLPs (for detection and attribution heads)
Author countries
China