ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Authors: Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao

Published: 2026-03-12 17:30:49+00:00

AI Summary

ForensicZip introduces a training-free framework for accelerating interpretable multimedia forensics in Multimodal Large Language Models (MLLMs) by reformulating visual token compression from a forgery-driven perspective. It addresses the limitations of semantic-driven pruning, which often discards crucial manipulation traces, by modeling temporal token evolution as a Birth-Death Optimal Transport problem and integrating transport-based novelty with high-frequency priors. This approach achieves significant speedup and FLOPs reduction while maintaining state-of-the-art detection performance even at aggressive token retention rates.

Abstract

Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\\% token retention, ForensicZip achieves $2.97\\times$ speedup and over 90\\% FLOPs reduction while maintaining state-of-the-art detection performance.


Key findings
ForensicZip achieved 2.97x speedup and over 90% FLOPs reduction at 10% visual token retention, maintaining state-of-the-art detection performance on deepfake and AIGC benchmarks. It demonstrated superior robustness compared to semantic-driven methods, which suffer catastrophic performance collapse at high compression ratios. The framework also reduced peak GPU memory consumption and surprisingly improved accuracy on object hallucination benchmarks by filtering out redundant semantic noise.
Approach
ForensicZip is a training-free framework that reformulates visual token compression from a forgery-driven perspective. It uses Transport Novelty Estimation (TNE) to model temporal token evolution as an augmented Birth-Death Optimal Transport problem, quantifying physical discontinuities. This is integrated with Forensic Scoring (FS) which combines transport-based novelty with high-frequency priors to selectively preserve forensic evidence while discarding semantic content under aggressive compression.
Datasets
FakeVLM (FakeClue, LOKI), DMimage, FakeShield-AIGC, SIDA, DD-VQA, FakeShield-DeepFake, DFFD, FakeShield-PhotoShop (aggregating CASIA), POPE
Model(s)
LLaVA-OneVision-7B, SIDA-7B/13B, FakeVLM, FakeShield
Author countries
China