FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection

Authors: Zhilin Tu, Kemou Li, Fengpeng Li, Jianwei Fei, Jiamin Zhang, Haiwei Wu

Published: 2026-03-23 12:55:31+00:00

Comment: 6th place (6/507) technical report at the NTIRE 2026: Robust AI-Generated Image Detection in the Wild Challenge

AI Summary

The paper introduces FeatDistill, a framework for robust AI-generated image detection designed for the NTIRE Challenge. It combines a multi-expert ensemble with feature distillation and extensive data degradation modeling to address issues like degradation interference, insufficient feature representation, and limited generalization in real-world scenarios. The framework achieves strong robustness and generalization, offering a practical solution for deepfake image detection under diverse "in-the-wild" conditions.

Abstract

The rapid iteration and widespread dissemination of deepfake technology have posed severe challenges to information security, making robust and generalizable detection of AI-generated forged images increasingly important. In this paper, we propose FeatDistill, an AI-generated image detection framework that integrates feature distillation with a multi-expert ensemble, developed for the NTIRE Challenge on Robust AI-Generated Image Detection in the Wild. The framework explicitly targets three practical bottlenecks in real-world forensics: degradation interference, insufficient feature representation, and limited generalization. Concretely, we build a four-backbone Vision Transformer (ViT) ensemble composed of CLIP and SigLIP variants to capture complementary forensic cues. To improve data coverage, we expand the training set and introduce comprehensive degradation modeling, which exposes the detector to diverse quality variations and synthesis artifacts commonly encountered in unconstrained scenarios. We further adopt a two-stage training paradigm: the model is first optimized with a standard binary classification objective, then refined by dense feature-level self-distillation for representation alignment. This design effectively mitigates overfitting and enhances semantic consistency of learned features. At inference time, the final prediction is obtained by averaging the probabilities from four independently trained experts, yielding stable and reliable decisions across unseen generators and complex degradations. Despite the ensemble design, the framework remains efficient, requiring only about 10 GB peak GPU memory. Extensive evaluations in the NTIRE challenge setting demonstrate that FeatDistill achieves strong robustness and generalization under diverse ``in-the-wild'' conditions, offering an effective and practical solution for real-world deepfake image detection.

Key findings

The multi-expert ensemble of CLIP-L/14 and SigLIP-400M models achieved the best generalization with a Robust AUC of 0.856 on the challenging Online Test (Hard) set. Dense feature-level distillation significantly improved performance, boosting CLIP-L/14's Robust ROC AUC from 0.8926 to 0.934 on the Online Validation set. Combining dense supervision, multi-source external data, and extended degradation strategies proved crucial for substantial improvements in detection robustness and generalization across diverse 'in-the-wild' conditions.

Approach

FeatDistill employs a multi-expert ensemble of four Vision Transformer (ViT) backbones (CLIP and SigLIP variants) to capture complementary forensic cues. It uses a two-stage training paradigm, first with standard binary classification, then refined by dense feature-level self-distillation for representation alignment. Data coverage is enhanced by expanding the training set with external samples and introducing comprehensive degradation modeling.

Datasets

NTIRE 2026 Robust AI-Generated Image Detection in the Wild competition data (including Toy, Training, 1st Validation, Hard Validation, Public Test, Private Test splits), DiTFake, DiffFace, De-Factify, Deepfake-60K.

Model(s)

Vision Transformer (ViT) backbones, specifically CLIP ViT-L/14 and SigLIP So400M variants. The ensemble consists of two CLIP ViT-L/14 models and two SigLIP So400M models.

Author countries

China, Macau, Saudi Arabia, Italy

← Previous