PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis

Authors: Soumyya Kanti Datta, Tanvi Ranga, Chengzhe Sun, Siwei Lyu

Published: 2025-10-16 02:51:42+00:00

AI Summary

This paper introduces PIA (Phoneme-Temporal and Identity-Dynamic Analysis), a novel multimodal audio-visual framework for deepfake detection. It addresses limitations of conventional methods by integrating language, dynamic face motion, and facial identification cues to detect subtle temporal discrepancies. PIA leverages phoneme sequences, lip geometry data, and facial identity embeddings to identify inconsistencies across multiple complementary modalities, significantly improving deepfake detection.

Abstract

The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at https://github.com/skrantidatta/PIA


Key findings
PIA achieves state-of-the-art performance on the FakeAVCeleb dataset, with ACC and AUC scores of 98.7% and 99.8% respectively, and demonstrates robust generalization in cross-manipulation settings. It also attains a high AUC of 98.06% on the high-resolution DeepSpeak v2.0 dataset, showcasing its effectiveness against diverse and high-quality deepfakes. Ablation studies confirm the critical contribution of each multimodal component, particularly viseme image cues, to the model's performance.
Approach
The PIA framework integrates multimodal features including lip geometry, viseme image crops, and ArcFace identity embeddings, which are extracted from frames pre-filtered by WhisperX/wav2vec2 phoneme alignment. These features are processed by dedicated encoders (3D CNN with EfficientNet-B0 backbone for visual, MLPs for others) and fused using a multi-head attention mechanism. An auxiliary ArcFace Temporal Consistency Loss is employed to penalize abrupt identity shifts between frames, alongside cross-entropy loss for classification.
Datasets
FakeAVCeleb, DeepSpeak v2.0
Model(s)
3D Convolutional Network (with EfficientNet-B0 backbone), Multi-head Attention, Multilayer Perceptrons, WhisperX, wav2vec2, MediaPipe FaceMesh, ArcFace (from InsightFace)
Author countries
USA