Exposing Lip-syncing Deepfakes from Mouth Inconsistencies

View on arXiv ← Back to list

Authors: Soumyya Kanti Datta, Shan Jia, Siwei Lyu

Published: 2024-01-18 16:35:37+00:00

AI Summary

This paper introduces LIPINC, a novel approach for lip-syncing deepfake detection that identifies temporal inconsistencies in the mouth region of videos. LIPINC outperforms state-of-the-art methods on benchmark deepfake datasets by focusing on inconsistencies in adjacent and globally similar mouth poses.

Abstract

A lip-syncing deepfake is a digitally manipulated video in which a person's lip movements are created convincingly using AI models to match altered or entirely new audio. Lip-syncing deepfakes are a dangerous type of deepfakes as the artifacts are limited to the lip region and more difficult to discern. In this paper, we describe a novel approach, LIP-syncing detection based on mouth INConsistency (LIPINC), for lip-syncing deepfake detection by identifying temporal inconsistencies in the mouth region. These inconsistencies are seen in the adjacent frames and throughout the video. Our model can successfully capture these irregularities and outperforms the state-of-the-art methods on several benchmark deepfake datasets. Code is available at https://github.com/skrantidatta/LIPINC

Key findings

LIPINC achieves state-of-the-art performance on the FakeAVCeleb dataset. It demonstrates strong generalization ability across different datasets (KODF, LSR+W2L), outperforming existing methods in several metrics (Precision, Accuracy, AP, AUC). The ablation study highlights the importance of both local and global mouth frame analysis, as well as the proposed inconsistency loss function.

Approach

LIPINC detects lip-syncing deepfakes by analyzing spatial-temporal inconsistencies in the mouth region. It extracts both locally adjacent and globally similar mouth frames, then uses a 3D-CNN and cross-attention to learn features representing these inconsistencies, finally classifying the video as real or fake.

Datasets

FakeAVCeleb, KODF, LSR+W2L

Model(s)

3D-CNN, cross-attention

Author countries

USA

← Previous