Exposing Lip-syncing Deepfakes from Mouth Inconsistencies

Authors: Soumyya Kanti Datta, Shan Jia, Siwei Lyu

Published: 2024-01-18 16:35:37+00:00

AI Summary

This paper introduces LIPINC, a novel method for detecting lip-syncing deepfakes by identifying temporal inconsistencies within the mouth region of videos. The approach focuses on irregularities in mouth shape, coloration, and dental structure across adjacent and similar-pose frames. LIPINC successfully captures these subtle artifacts, outperforming state-of-the-art methods on multiple benchmark deepfake datasets.

Abstract

A lip-syncing deepfake is a digitally manipulated video in which a person's lip movements are created convincingly using AI models to match altered or entirely new audio. Lip-syncing deepfakes are a dangerous type of deepfakes as the artifacts are limited to the lip region and more difficult to discern. In this paper, we describe a novel approach, LIP-syncing detection based on mouth INConsistency (LIPINC), for lip-syncing deepfake detection by identifying temporal inconsistencies in the mouth region. These inconsistencies are seen in the adjacent frames and throughout the video. Our model can successfully capture these irregularities and outperforms the state-of-the-art methods on several benchmark deepfake datasets. Code is available at https://github.com/skrantidatta/LIPINC


Key findings
LIPINC achieves superior performance in detecting in-domain lip-syncing deepfakes on FakeAVCeleb, often outperforming state-of-the-art methods. The model demonstrates strong generalization capability for cross-domain lip-syncing detection on KODF-LS and LSR+W2L, achieving AUC scores over 87%. Ablation studies confirm that both local and global mouth frames, color and structure features, and the proposed inconsistency loss are crucial for the model's effectiveness and generalization.
Approach
The LIPINC model employs a Local and Global Mouth Frame Extractor to isolate adjacent and similarly posed open-mouth segments from video frames. These extracted frames are then processed by a Mouth Spatial-Temporal Inconsistency Extractor (MSTIE), which utilizes 3D-CNNs and a cross-attention module to encode color and structure features. The detection is guided by a novel inconsistency loss, based on Structural Similarity Index (SSIM), alongside a standard classification loss.
Datasets
FakeAVCeleb (FakeAV-LS, FakeAV-FS), KODF (KODF-LS, KODF-FSGAN, KODF-DFL), LSR+W2L (generated using LRS2 as source)
Model(s)
Dlib (for face/mouth landmark extraction), 3D-CNN, Cross-attention, Adam optimizer
Author countries
United States