Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

Authors: Nicholas Klein, Hemlata Tak, James Fullwood, Krishna Regmi, Leonidas Spinoulas, Ganesh Sivaraman, Tianxiang Chen, Elie Khoury

Published: 2025-08-11 16:14:17+00:00

AI Summary

This paper presents robust solutions for deepfake video classification and fine-grained localization in both audio and visual domains. The proposed methods, based on ensembles of specialized networks and an ActionFormer-inspired localization paradigm, were submitted to the ACM 1M Deepfakes Detection Challenge. They achieved the best performance in temporal localization and a top four ranking in the classification task for the TestA split of the evaluation dataset.

Abstract

The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.


Key findings
The proposed system achieved first place in the temporal localization task (score of 67.20%) and fourth place in the classification task (AUC of 92.49%) on the TestA set of the 2025 ACM Multimedia 1M-Deepfakes Detection Challenge. The fusion of diverse audio and visual models proved effective, with the ResNet-based model demonstrating strong segment boundary prediction capabilities for localization.
Approach
The authors propose an ensemble of specialized networks that independently target audio and visual manipulations for both classification and localization tasks. For classification, models are fused using score-level polynomial logistic regression. For localization, they adapt existing backbones with an ActionFormer-inspired training paradigm, utilizing frame-wise classification and segment boundary regression heads, and fuse segment proposals using Soft-NMS.
Datasets
AV-Deepfake1M++, 2025 ACM Multimedia 1M-Deepfakes Detection Challenge
Model(s)
ResNet-152 (with RawBoost for audio), MultiReso gMLP (with Wav2Vec2 transformer for audio), LipForensics (VSR encoder + MS-TCN backend for visual), SSL+LSTM (Wav2Vec2 + LSTM for audio localization), ActionFormer (localization paradigm), Soft-NMS (localization fusion).
Author countries
USA