ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection

Authors: Xin Zhang, Jiaming Chu, Jian Zhao, Yuchu Jiang, Xu Yang, Lei Jin, Chi Zhang, Xuelong Li

Published: 2025-08-24 10:03:46+00:00

AI Summary

ERF-BA-TFD+ is a multimodal deepfake detection model that uses enhanced receptive fields and audio-visual fusion to improve detection accuracy and robustness. It models long-range dependencies in audio-visual input to better capture subtle discrepancies between real and fake content, achieving state-of-the-art results on the DDL-AV dataset.

Abstract

Deepfake detection is a critical task in identifying manipulated multimedia content. In real-world scenarios, deepfake content can manifest across multiple modalities, including audio and video. To address this challenge, we present ERF-BA-TFD+, a novel multimodal deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion. Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness. The key innovation of ERF-BA-TFD+ lies in its ability to model long-range dependencies within the audio-visual input, allowing it to better capture subtle discrepancies between real and fake content. In our experiments, we evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips. Unlike previous benchmarks, which focused primarily on isolated segments, the DDL-AV dataset allows us to assess the model's performance in a more comprehensive and realistic setting. Our method achieves state-of-the-art results on this dataset, outperforming existing techniques in terms of both accuracy and processing speed. The ERF-BA-TFD+ model demonstrated its effectiveness in the Workshop on Deepfake Detection, Localization, and Interpretability, Track 2: Audio-Visual Detection and Localization (DDL-AV), and won first place in this competition.


Key findings
ERF-BA-TFD+ achieved state-of-the-art results on the DDL-AV dataset, outperforming existing methods in accuracy and speed. The model's performance improved significantly after integrating the UMMA framework and the ERF module, particularly for detecting audio discrepancies and long-duration manipulations.
Approach
The model processes audio and video features simultaneously using MViTv2 and BYOL-A encoders, respectively. A Cross-Reconstruction Attention Transformer (CRA-Trans) module learns cross-modal temporal dependencies, and a Frame Classification Module and Boundary Localization Module perform frame-level and segment-level classification and localization.
Datasets
DDL-AV dataset (includes segmented and full-length video clips)
Model(s)
MViTv2 (video encoder), BYOL-A (audio encoder), Cross-Reconstruction Attention Transformer (CRA-Trans), Logistic Regression (frame classifier)
Author countries
China