Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

Authors: Simiao Ren, Xingyu Shen, Ankit Raj, Albert Dai, Caroline, Zhang, Yuan Xu, Zexi Chen, Siqi Wu, Chen Gong, Yuxin Zhang

Published: 2026-02-08 04:44:31+00:00

AI Summary

This paper presents the first large-scale benchmark comparing 34 models, including 22 specialized architectures and 12 general-purpose Vision-Language Models (VLMs), for facial age estimation across eight standard datasets. It reveals that zero-shot VLMs significantly outperform most specialized models, achieving a 43% lower average Mean Absolute Error (MAE) and challenging the necessity of task-specific architectures. The study also highlights VLMs' superior performance in age verification tasks at the 18-year threshold.

Abstract

Facial age estimation plays a critical role in content moderation, age verification, and deepfake detection. However, no prior benchmark has systematically compared modern vision-language models (VLMs) with specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating 34 models - 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs - across eight standard datasets (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, and AgeDB), totaling 1,100 test images per model. Our key finding is striking: zero-shot VLMs significantly outperform most specialized models, achieving an average mean absolute error (MAE) of 5.65 years compared to 9.88 years for non-LLM models. The best-performing VLM (Gemini 3 Flash Preview, MAE 4.32) surpasses the strongest non-LLM model (MiVOLO, MAE 5.10) by 15%. MiVOLO - unique in combining face and body features using Vision Transformers - is the only specialized model that remains competitive with VLMs. We further analyze age verification at the 18-year threshold and find that most non-LLM models exhibit false adult rates between 39% and 100% for minors, whereas VLMs reduce this to 16%-29%. Additionally, coarse age binning (8-9 classes) consistently increases MAE beyond 13 years. Stratified analysis across 14 age groups reveals that all models struggle most at extreme ages (under 5 and over 65). Overall, these findings challenge the assumption that task-specific architectures are necessary for high-performance age estimation and suggest that future work should focus on distilling VLM capabilities into efficient specialized models.


Key findings
Zero-shot Vision-Language Models (VLMs) dramatically outperform most specialized age estimation models, achieving an average MAE of 5.65 years compared to 9.88 years for non-LLM models. VLMs also show significantly lower false adult rates (16-29%) for minors at the 18-year threshold, crucial for age verification. Only MiVOLO, which uniquely combines face and body features using Vision Transformers, remains competitive among specialized models.
Approach
The researchers conducted a comprehensive cross-paradigm benchmark to evaluate 34 existing models (22 specialized architectures and 12 zero-shot Vision-Language Models) on their ability to estimate age from facial imagery. They measured performance primarily using Mean Absolute Error (MAE) and further analyzed age verification rates at the 18-year threshold, as well as stratified performance across different age groups.
Datasets
UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, AgeDB
Model(s)
Vision-Language Models: Gemini 3 Flash Preview, Gemini 2.5 Flash, GPT-5 Nano, GPT-5.2, Qwen3-VL 235B, Seed-1.6, Kimi-K2.5, Claude Sonnet 4.5, Mistral Small 3.2, Llama 4 Maverick, Grok 4.1 Fast, Claude Haiku 4.5. Specialized Architectures: MiVOLO (ViT + YOLOv8), Herosan-Age, Mivialab-Age, DEX (VGG-16), BoyuanJiang (TF), Yu4u (PyTorch), Gitliber, Muno-AI (HF), Py-Agender, TUT-Live-Age, DeepFace, Cetinsamet, MWR (VGG-16 + k-NN), SSR-Net, ManelBadri (HF), UniFace, FaceAge (Harvard), InsightFace, ChienThan, FaceXFormer, AtulSingh, Nixrajput.
Author countries
USA