Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition

Authors: Maheswar Bora, Tashvik Dhamija, Shukesh Reddy, Baptiste Chopin, Pranav Balaji, Abhijit Das, Antitza Dantcheva

Published: 2025-11-27 13:30:59+00:00

AI Summary

The paper introduces FauxNet, a novel network for generalizable deepfake detection that leverages pre-trained Visual Speech Recognition (VSR) features extracted from lip movements in videos. FauxNet achieves state-of-the-art performance in zero-shot detection settings and is also uniquely designed to attribute (classify) the specific deepfake generation technique used. The authors further contribute two new, large-scale datasets, Authentica-Vox and Authentica-HDTF, featuring recent audio and video-driven deepfake methods.

Abstract

Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.


Key findings
FauxNet demonstrates significant superiority in zero-shot detection, achieving an AUC of 0.9969 on Authentica-HDTF when trained solely on FF++, substantially outperforming other methods. Furthermore, FauxNet excels at deepfake attribution, classifying generation techniques with high accuracy (93.35% on Authentica-Vox and 90.99% on Authentica-HDTF). This strong generalization ability is attributed to the distinctly separable feature clusters produced by the VSR encoder for real and fake videos.
Approach
FauxNet extracts visual speech features by cropping the lip region of video frames and passing the sequence through a pre-trained VSR encoder (Auto-AVSR). The resulting temporal features are average pooled into a unified video embedding. This embedding is trained using a multi-task learning framework comprising a BinaryHead for real/fake detection and a MultiHead for classifying the deepfake generation technique.
Datasets
Authentica-Vox, Authentica-HDTF, FaceForensics++ (FF++)
Model(s)
FauxNet (based on VSR Encoder, specifically Auto-AVSR [46]), Multi-task MLP classifier
Author countries
India, France