MARLIN: Masked Autoencoder for facial video Representation LearnINg

Authors: Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari, Munawar Hayat

Published: 2022-11-12 10:29:05+00:00

Comment: CVPR 2023

AI Summary

This paper proposes MARLIN, a self-supervised facial video masked autoencoder, to learn universal facial representations from abundantly available non-annotated web-crawled videos. MARLIN reconstructs spatio-temporal facial details from densely masked regions, capturing local and global aspects for generic and transferable features. It demonstrates excellent performance across various facial analysis tasks including Facial Attribute Recognition, Facial Expression Recognition, DeepFake Detection, and Lip Synchronization.

Abstract

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN .


Key findings
MARLIN consistently outperforms benchmarks on various downstream tasks, showing 1.13% gain for Facial Attribute Recognition (over supervised), 2.64% gain for Facial Expression Recognition (over unsupervised), and 1.86% gain for DeepFake Detection (over unsupervised). For Lip Synchronization, it achieved a 29.36% gain in Frechet Inception Distance, demonstrating its ability to learn robust, generic, and transferable facial representations even in low data regimes.
Approach
MARLIN is a self-supervised facial video masked autoencoder that learns robust and generic facial embeddings. It solves a challenging auxiliary task by reconstructing spatio-temporal details of the face from densely masked regions, using a facial-region guided tube masking strategy (Fasking) and adversarial training. This process helps capture both local and global facial information, leading to transferable features.
Datasets
YouTube Faces (YTF), CelebV-HQ, CMU-MOSEI, FaceForensics++ (FF++), LRS2
Model(s)
MARLIN (a facial video masked autoencoder with a Vision Transformer (ViT) backbone) and a discriminator for adversarial training.
Author countries
Australia, India