Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

View on arXiv ← Back to list

Authors: Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani, Koki Nagano, David Luebke, Orazio Gallo, Matthew Stamm

Published: 2025-10-03 22:37:03+00:00

AI Summary

AI-based videoconferencing systems are highly vulnerable to puppeteering attacks where an attacker hijacks a victim's identity by manipulating the transmitted pose-expression latent embedding. This paper introduces the first biometric leakage defense that operates entirely in the latent domain, exploiting identity cues inadvertently contained within these embeddings. By using a pose-conditioned, large-margin contrastive encoder, the method successfully isolates persistent identity cues from transient pose and expression, enabling real-time detection of illicit identity swaps.

Abstract

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

Key findings

The proposed method achieves state-of-the-art puppeteering detection with an average AUC exceeding 0.97 across diverse generator/dataset combinations, significantly outperforming deepfake detectors and previous puppeteering defenses. It demonstrates strong generalization capabilities to out-of-domain scenarios, maintaining an average AUC of 0.925 when tested on unseen datasets. The defense operates efficiently in real-time, achieving 75 FPS, making it practical for deployment in bandwidth-constrained videoconferencing systems.

Approach

The raw pose-expression latent vectors are re-encoded into a compact Enhanced Biometric Leakage (EBL) space using lightweight projection heads (MLPs). This re-encoding is optimized using a novel Pose-Conditioned Large-Margin Cosine Loss (PC-LMCL) that maximizes identity separability while actively suppressing pose variance. Puppeteering is flagged in real-time by comparing the live EBL embedding against the initial reference embedding similarity, stabilized by a temporal LSTM aggregator.

Datasets

NVFAIR dataset pool, including NVIDIA VC (NVC), RAVDESS (RAV), and CREMA-D (CRD). Videos generated using 3DFaceShop, MCNET, EMOPortraits, SDFR, and LivePortrait.

Model(s)

Multi-Layer Perceptrons (MLPs) as projection heads, trained with a Pose-Conditioned Large-Margin Cosine Loss (PC-LMCL), combined with a two-layer Temporal LSTM for score aggregation.

Author countries

USA

← Previous