Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

Authors: Soumyya Kanti Datta, Shan Jia, Siwei Lyu

Published: 2025-04-02 08:24:06+00:00

AI Summary

This paper introduces LIPINC-V2, a deepfake detection framework using a vision temporal transformer with multihead cross-attention to identify spatiotemporal inconsistencies in the mouth region of lip-syncing deepfakes. The framework achieves state-of-the-art performance on existing and a newly created dataset, LipSyncTIMIT.

Abstract

Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .


Key findings
LIPINC-V2 achieves state-of-the-art performance on multiple datasets, demonstrating robustness and generalization capabilities. The model shows resilience to video compression and resolution reduction. A segment-wise localization task further enhances the model's applicability to partially manipulated videos.
Approach
LIPINC-V2 detects lip-syncing deepfakes by analyzing spatiotemporal inconsistencies in the mouth region. It leverages a vision temporal transformer with multihead cross-attention to capture both short-term (adjacent frames) and long-term (similarly posed frames) variations in mouth movements. An inconsistency loss function further enhances the model's ability to identify subtle irregularities.
Datasets
FakeAVCeleb, LipSyncTIMIT (newly created), KODF
Model(s)
Vision Temporal Transformer with multihead cross-attention
Author countries
USA, China