Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

Authors: Zeqian Long, Ozgur Kara, Haotian Xue, Yongxin Chen, James M. Rehg

Published: 2026-04-12 22:13:02+00:00

AI Summary

This paper introduces Immune2V, a novel framework designed to protect static images from unauthorized Image-to-Video (I2V) deepfake generation. It systematically analyzes why existing image-level defenses fail against modern dual-stream I2V models, identifying temporal perturbation attenuation and continuous text-conditioned guidance as key vulnerabilities. Immune2V addresses these by enforcing temporally balanced latent divergence and aligning intermediate generative representations with a precomputed collapse-inducing trajectory.

Abstract

Image-to-video (I2V) generation has the potential for societal harm because it enables the unauthorized animation of static images to create realistic deepfakes. While existing defenses effectively protect against static image manipulation, extending these to I2V generation remains underexplored and non-trivial. In this paper, we systematically analyze why modern I2V models are highly robust against naive image-level adversarial attacks (i.e., immunization). We observe that the video encoding process rapidly dilutes the adversarial noise across future frames, and the continuous text-conditioned guidance actively overrides the intended disruptive effect of the immunization. Building on these findings, we propose the Immune2V framework which enforces temporally balanced latent divergence at the encoder level to prevent signal dilution, and aligns intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override. Extensive experiments demonstrate that Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.


Key findings
Immune2V significantly outperforms adapted image-level baselines, demonstrating substantially stronger and more persistent degradation in generated videos while maintaining imperceptibility of the immunized image. Quantitative metrics and VLM-as-Judge evaluations confirm its effectiveness in disrupting video structural quality, subject consistency, motion smoothness, and text alignment. Ablation studies validate the necessity of both spatial-temporal and semantic attack components for achieving consistent structural collapse across the video.
Approach
The authors analyze dual-stream I2V models, identifying that naive image-level attacks fail due to rapid adversarial noise dilution in the spatial-temporal stream and overriding text-conditioned guidance in the semantic stream. Immune2V proposes a joint optimization framework with a Spatial-Temporal Attack (using a temporally-balanced VAE-level loss) and a Semantic Attack (redirecting denoising trajectories towards collapse-inducing dynamics by manipulating DiT representations).
Datasets
DAVIS dataset
Model(s)
UNKNOWN
Author countries
United States