A Spatial-Frequency Aware Multi-Scale Fusion Network for Real-Time Deepfake Detection

Authors: Libo Lv, Tianyi Wang, Mengxiao Huang, Ruixia Liu, Yinglong Wang

Published: 2025-08-28 05:55:28+00:00

AI Summary

The paper introduces SFMFNet, a lightweight deepfake detection network designed for real-time applications. SFMFNet achieves a balance between accuracy and efficiency by using a spatial-frequency hybrid aware module and a token-selective cross-attention mechanism.

Abstract

With the rapid advancement of real-time deepfake generation techniques, forged content is becoming increasingly realistic and widespread across applications like video conferencing and social media. Although state-of-the-art detectors achieve high accuracy on standard benchmarks, their heavy computational cost hinders real-time deployment in practical applications. To address this, we propose the Spatial-Frequency Aware Multi-Scale Fusion Network (SFMFNet), a lightweight yet effective architecture for real-time deepfake detection. We design a spatial-frequency hybrid aware module that jointly leverages spatial textures and frequency artifacts through a gated mechanism, enhancing sensitivity to subtle manipulations. A token-selective cross attention mechanism enables efficient multi-level feature interaction, while a residual-enhanced blur pooling structure helps retain key semantic cues during downsampling. Experiments on several benchmark datasets show that SFMFNet achieves a favorable balance between accuracy and efficiency, with strong generalization and practical value for real-time applications.


Key findings
SFMFNet achieves high accuracy on multiple benchmark datasets while maintaining low computational cost, outperforming many existing methods in terms of AUC. The ablation study validates the contribution of each module in improving detection performance. The model shows a favorable balance between accuracy and efficiency suitable for real-time applications.
Approach
SFMFNet uses a CNN backbone to extract multi-scale features. A spatial-frequency hybrid aware module fuses wavelet and spatial attention features, enhancing forgery detection. A token-selective cross-attention mechanism improves multi-level feature interaction.
Datasets
FaceForensics++, DeepFake Detection (DFD), Celeb-DF v2 (CDF2), DeepFake Detection Challenge Preview (DFDCP), DeepFake Detection Challenge (DFDC), and UAD Fake Video (UADFV)
Model(s)
SFMFNet (Spatial-Frequency Aware Multi-Scale Fusion Network), various models used as baselines for comparison including RegNet, GoogLeNet, AlexNet, ResNet18, CNN-Aug, Xception, Capsule, FWA, X-ray, FFD, UCF, F3Net, and SPSL
Author countries
China, Singapore