Generative Omnimatte
Learning to Decompose Video into Layers

Yao-Chih Lee^1,2 Erika Lu¹ Sarah Rumbley¹ Michal Geyer^1,3 Jia-Bin Huang² Tali Dekel^1,3 Forrester Cole¹

¹Google DeepMind ²University of Maryland College Park ³Weizmann Institute of Science

CVPR 2025 Highlight

PDF arXiv Code & Data Video BibTex

Input video

Omnimatte layers

Our method decomposes a video into a set of RGBA omnimatte layers,
where each layer consists of a fully-visible object and its associated effects like shadows and reflections.

Our omnimattes enable a wide range of video editing for users. (Scroll to view more videos)

Our omnimattes enable a wide range of video editing for users.
(Scroll to view more videos)

Comparisons on Omnimattes

Background Foreground #1 Alpha #1 Foreground #2 Alpha #2 Foreground #3 Alpha #3

Input

Omnimatte

Omnimatte3D

OmnimatteRF

FactorMatte

Ours

We compare our method with existing omnimatte methods (Omnimatte, Omnimatte3D, OmnimatteRF, and FactorMatte). Existing methods rely on restrictive motion assumptions, such as stationary background, resulting in dynamic background elements becoming entangled with foreground object layers. Omnimatte3D and OmnimatteRF may also produce blurry background layers (e.g., horses) because their 3D-aware background representations are sensitive to camera pose estimation quality. Furthermore, these methods lack a generative and semantic prior for completing occluded pixels and accurately associating effects with their corresponding objects.

Comparisons on Object and Effect Removal

Input & object to remove

ProPainter

Lumiere-Inpainting

ObjectDrop

Ours

We compare our object-effect-removal model, Casper, with existing methods for object removal. Video inpainting models (ProPainter and Lumiere-Inpainting) fail to remove soft shadows and reflections outside the input masks. ObjectDrop is an image-based model, and thus, it processes each video frame independently and inpaints regions without global context and temporal consistency. We use the same ratio of mask dilation for all the methods.

Method

Given an input video and binary object masks, we first apply our object-effect-removal model, Casper, to generate a clean-plate background and a set of single-object (solo) videos applying different trimask conditions. The trimasks specify regions to preserve (white), remove (black), and regions that potentially contain uncertain object effects (gray). In Stage 2, a test-time optimization reconstructs the omnimatte layers Oi from pairs of solo video and background video.

Object and Effect Removal with Trimask Condition

We use different trimask conditions for an input video to obtain a set of single-object (solo) videos and a clean-plate background video (bottom row). Note that we do not cherry pick the random seeds for the Casper model. We use the same random seed (=0) for all different input videos.

Input

Trimask

Output removal

Training data

Omnimatte Tripod Kubric Object-Paste

(i) Omnimatte: We collect omnimatte results from existing omnimatte methods (Omnimatte, Omnimatte3D, and OmnimatteRF) to provide examples of cause-and-effect relationships in real videos.

(ii) Tripod: The Tripod dataset consists of videos captured with stationary cameras, providing pseudo-examples of more complex real-world scenarios, such as water effects and dynamic backgrounds.

(iii) Kubric: We use Kubric to synthesize multi-object scenes with diverse reflections and shadows. we observe that many real-world scenarios exhibit multiple instances of the same object type in a scene, such as dogs, pedestrians, or vehicles. Therefore, we generate scenes with duplicated objects to train the model to handle multiple similar objects.

(iv) Object-Paste: We segment objects from real videos and paste them onto target real videos to strengthen the model’s inpainting capabilities and background preservation.

For the synthesized Kubric and Object-Paste data, we randomly switch the gray and white colors to encourage the model to learn background preservation and inpainting for gray-labeled regions. The training data is augmented through horizontal and temporal flipping, as well as random cropping.

Ablation Study on Training data of Casper

Input

Trimask

Omnimatte-only

+ Tripod

+ Kubric

+ Object-Paste (full)

We assess the individual contributions of each dataset category to our model's performance by incrementally adding each category to the training set. While the Omnimatte data provides basic examples of shadows in real-world videos, it primarily features static backgrounds and single objects. The Tripod data provides additional real-world scenarios to handle better water effects, such as reflections and boat wakes. Our Kubric synthetic data strengthens the models' ability to handle multi-object scenes. Finally, the Object-Paste data reduces undesired background changes and improves inpainting quality.

Visualization of Effect Association in the Self-Attention of Video Generator

Foreground #1 Foreground #2 Foreground #3

Spatial Attention Block #1 Spatial Attention Block #9 Spatial Attention Block #16

Input & target object
for visualization metric

Lumiere T2V
output & attention

Lumiere Inpainting
output & attention

Our Casper
output & attention

Expand the descriptions of visualization

To investigate the inherent understanding of object-effect associations in the text-to-video (T2V) Lumiere generation model, we analyze its self-attention patterns during the denoising process using SDEdit. We hypothesize that the T2V model possesses an intrinsic understanding of effect associations, allowing us to train an effective object-effect-removal model with a relatively small dataset.

We further compare the attention behaviors of the original T2V model, the Lumiere-Inpainting model, and our Casper model, which is sequentially fine-tuned from the T2V model. To ensure accurate attention measurement, we do not dilate the input mask conditions for both Inpainting and Casper models.

The visualized value of each pixel indicates the strength of association between its query token and the key tokens in the target object mask region. We visualize the first, middle, and final attention blocks of the U-Net architecture at the sampling step t=0.125. For a detailed description of the attention visualization metric, please refer to Section 3.3 of our main paper.

We observe that the T2V model's object query tokens exhibit a strong focus on the object itself, as its primary task is to generate the object and its effects. This tendency may also be present in the Inpainting model when it attempts to fill the mask region with another object to justify shadows. In contrast, Casper's object query tokens show less self-attention and more attention to the background region, suggesting a focus on background completion rather than object and effect generation.

In multi-object scenarios (boys-beach, five-beagles), the T2V and Inpainting models may associate different, similar objects with the target object. Our Casper model, however, demonstrates a lower attention response (darker) to similar objects, indicating a stronger ability to isolate individual objects.

We also analyzed the attention patterns of the failure case, five-beagles, where our Casper model did not remove the corresponding shadow completely. We hypothesize that the effect association is already weak in the T2V model, and our Casper model, inheriting knowledge from the pretrained models, struggles to handle such challenging cases.

Generative Omnimatte
Learning to Decompose Video into Layers

CVPR 2025 Highlight

Our method decomposes a video into a set of RGBA omnimatte layers,
where each layer consists of a fully-visible object and its associated effects like shadows and reflections.

Our omnimattes enable a wide range of video editing for users. (Scroll to view more videos)

Our omnimattes enable a wide range of video editing for users.
(Scroll to view more videos)

Comparisons on Omnimattes

Comparisons on Object and Effect Removal

Method

Object and Effect Removal with Trimask Condition

Training data

Ablation Study on Training data of Casper

Ablation Study on Input Condition of Casper

Our Limitations

User-specified trimask

Visualization of Effect Association in the Self-Attention of Video Generator

Acknowledgments

BibTeX

Generative Omnimatte Learning to Decompose Video into Layers

CVPR 2025 Highlight

Our method decomposes a video into a set of RGBA omnimatte layers,where each layer consists of a fully-visible object and its associated effects like shadows and reflections.

Our omnimattes enable a wide range of video editing for users. (Scroll to view more videos)

Our omnimattes enable a wide range of video editing for users. (Scroll to view more videos)

Comparisons on Omnimattes

Comparisons on Object and Effect Removal

Method

Object and Effect Removal with Trimask Condition

Training data

Ablation Study on Training data of Casper

Ablation Study on Input Condition of Casper

Our Limitations

User-specified trimask

Visualization of Effect Association in the Self-Attention of Video Generator

Acknowledgments

BibTeX

Generative Omnimatte
Learning to Decompose Video into Layers

Our method decomposes a video into a set of RGBA omnimatte layers,
where each layer consists of a fully-visible object and its associated effects like shadows and reflections.

Our omnimattes enable a wide range of video editing for users.
(Scroll to view more videos)