Modeling Biomechanical Constraint Violations for Language-Agnostic Lip-Sync Deepfake Detection

Authors: Hao Chen, Junnan Xu

Published: 2026-04-18 03:32:40+00:00

Comment: 8 pages, 4 figures. Keywords: deepfake detection, lip-sync forgery, biomechanical constraints, temporal kinematics, cross-lingual generalization, privacy-preserving detection, geometric features

AI Summary

This paper introduces BioLip, a lightweight framework for language-agnostic lip-sync deepfake detection. It operates by identifying violations of biomechanical constraints in synthetic videos, specifically an elevated temporal lip variance termed 'temporal lip jitter', which is consistent across language, ethnicity, and recording conditions. The framework processes 64 perioral landmark coordinates to detect these physics-grounded anomalies.

Abstract

Current lip-sync deepfake detectors rely on pixel-level artifacts or audio-visual correspondence, failing to generalize across languages because these cues encode data-dependent patterns rather than universal physical laws. We identify a more fundamental principle: generative models do not enforce the biomechanical constraints of authentic orofacial articulation, producing measurably elevated temporal lip variance -- a signal we term temporal lip jitter -- that is empirically consistent across the speaker's language, ethnicity, and recording conditions. We instantiate this principle through BioLip, a lightweight framework operating on 64 perioral landmark coordinates extracted by MediaPipe.


Key findings
BioLip demonstrated strong zero-shot generalization, achieving AUCs of 0.905 on English, 0.779 on Chinese Mandarin, and 0.843 on the seven-language PolyGlotFake dataset (with an unseen generator), outperforming prior baselines. The study revealed that physics-grounded temporal kinematic features enable consistent cross-lingual generalization, whereas spectral features encode language-dependent phonological patterns and degrade transfer. This confirms that biomechanical constraint violations, manifesting as elevated temporal lip jitter, are a reliable and universal signal for lip-sync deepfake detection.
Approach
BioLip extracts 64 perioral landmark coordinates from video frames using MediaPipe, normalizes them, and then computes four temporal kinematic statistics (displacement, velocity, acceleration, and jerk) from the y-coordinates over 25-frame sliding windows. These 256-dimensional features, serving as implicit physics priors, are subsequently fed into a lightweight Multilayer Perceptron (MLP) for binary deepfake classification.
Datasets
AVLips, CMLR, FakeAVCeleb, PolyGlotFake
Model(s)
Multilayer Perceptron (MLP) (107,777 parameters)
Author countries
UNKNOWN