Dimensional Coactivation for Representational Consistency in Frozen Vision Foundation Models

Authors: Izaldein Al-Zyoud Abdulmotaleb El Saddik

Published: 2026-05-07 15:32:47+00:00

AI Summary

This paper introduces Representational Consistency (RC), a new concept for measuring whether a frozen vision foundation model represents a single input coherently across its semantic subregions. It proposes Dimensional Coactivation (DCA), a per-dimension instrument that avoids standard similarity measure operations like centering and L2 normalization to preserve crucial intra-sample signal. DCA is validated through cross-dataset deepfake detection, showing that synthetic faces break this representational coherence, enabling their detection.

Abstract

Frozen vision foundation models do not merely extract features; they organize images through a learned coordinate system. We ask whether that coordinate system remains internally coherent within a single input. This leads to Representational Consistency: the study of whether a frozen foundation model represents one sample coherently across its semantic subregions. We introduce Dimensional Coactivation (DCA), a per-dimension instrument for measuring this coherence. DCA compares semantic regions by asking whether the same feature dimensions coactivate across them. Unlike classical similarity measures, it deliberately avoids centering, L2 normalization, and full Gram coupling. These operations are useful when comparing different models or distributions, but they are mismatched to the intra-sample setting, where the coordinate system is fixed and raw magnitude carries signal. Deepfake detection provides a natural validation task. Synthetic faces may reproduce plausible eyes, noses, and mouths while breaking the representational structure that links those regions in real faces. Using frozen DINOv3 features, DCA exposes this break: an eyes-mouth-nose fingerprint achieves 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. The design is also sharply validated by ablation: reintroducing centering collapses CelebDF-v2 AUC to 0.459, L2 normalization reduces it to 0.862, and cross-dimension coupling reduces it to 0.478. Finally, replacing DINOv3 with FaRL collapses CelebDF-v2 AUC to 0.582. DCA therefore depends on a stable per-dimension coordinate system, not on region extraction alone. These results position DCA as an instrument for measuring intra-sample representational coherence in frozen foundation models, with deepfake detection as the first validation task.


Key findings
DCA achieved 0.9106 AUC on CelebDF-v2 and 0.9289 AUC on DFD for deepfake detection, outperforming single-region and scalar-reduced baselines. Ablations confirmed that reintroducing centering, L2 normalization, or cross-dimension coupling severely degrades performance (e.g., centering collapses CelebDF-v2 AUC to 0.459). The success of DCA is dependent on the backbone's coordinate system stability, as replacing DINOv3 with FaRL collapsed CelebDF-v2 AUC to 0.582.
Approach
The authors propose Dimensional Coactivation (DCA) to measure representational consistency within a frozen vision foundation model. DCA compares semantic regions of an input (e.g., eyes, mouth, nose) by asking whether the same feature dimensions coactivate across them, generating a per-sample, per-dimension, magnitude-preserving, and regional fingerprint. Crucially, DCA avoids centering, L2 normalization, and full Gram coupling to preserve raw magnitude and mean activation signals specific to intra-sample coherence.
Datasets
FaceForensics++ (FF++ c23), CelebDF-v2, DeepFakeDetection (DFD)
Model(s)
DINOv3 ViT-L/16 (frozen backbone), FaRL ViT-B/16 (for region assignment), RetinaFace (face detection), Logistic Regression (linear probe classifier)
Author countries
Canada