PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis

Authors: Soumyya Kanti Datta, Tanvi Ranga, Chengzhe Sun, Siwei Lyu

Published: 2025-10-16 02:51:42+00:00

AI Summary

PIA is a novel multimodal audio-visual framework for deepfake detection designed to overcome the limitations of traditional detectors against advanced generative models. It integrates phoneme sequences, lip geometry data, and facial identity embeddings to identify subtle temporal and cross-modal inconsistencies. This approach leverages discrepancies across language, dynamic motion, and identity dynamics for robust detection.

Abstract

The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at https://github.com/skrantidatta/PIA


Key findings
PIA achieves state-of-the-art performance, reaching 99.8% AUC on FakeAVCeleb and 98.06% AUC on the high-resolution DeepSpeak v2.0 dataset. Ablation studies confirm the critical role of integrating visual, geometric, and identity cues, demonstrating robustness across lip-sync, face-swap, and avatar manipulations. The method shows superior generalization capabilities in cross-manipulation settings compared to baseline models.
Approach
PIA uses a three-stream architecture encoding viseme image crops (via 3D CNN and EfficientNet-B0), lip geometry descriptors (MAR), and ArcFace identity embeddings, which are filtered based on WhisperX phoneme alignment. These features are fused using a multi-head attention mechanism for final classification. An auxiliary ArcFace Temporal Consistency Loss is implemented to penalize abrupt identity shifts, enhancing detection of temporal inconsistencies caused by manipulations like face-swaps.
Datasets
FakeAVCeleb, DeepSpeak v2.0
Model(s)
3D Convolutional Network, EfficientNet-B0, Multi-head Attention, ArcFace, WhisperX, wav2vec2, MediaPipe
Author countries
USA