EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition

Yihan Hu1          Xuelin Chen2         Xiaodong Cun1,*
1 GVC Lab, Great Bay University       2 Adobe Research
TL;DR: EasyOmnimatte split videos into layers (with effects) using 2 step diffusion without any post optimization.
Method

Gallery Results

Abstract

Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.

Method

Method

a) We branch out LoRA-finetuned blocks from the original inpainting DiT blocks to jointly predict the alpha matte, alongside the pretrained model. In each Branch DiT Block, LoRA are applied only to the duplicated set of input tokens, leaving the original inpainting branch unaffected. b) During sampling, the Effect Expert model is only employed at early, high-noise stages to generate coarse, effect-aware omnimatte predictions, while the Quality Expert model refines the alpha matte only at later, low-noise stages. This alternating strategy achieves high-quality results with greatly reduced compute cost, compared to individually sampling.

Visual Comparisons

Click the button below to reveal more results.

Applications

We showcase some applications supported by Video Layering, i.e., editing a specific video layer.

Ablation Study

Effectiveness of different attention masking strategies.

Ablation of Branch DiT Placement & Dual Expert Strategy.

Citation