We compare our method with existing omnimatte methods (Omnimatte, Omnimatte3D, OmnimatteRF, and FactorMatte). Existing methods rely on restrictive motion assumptions, such as stationary background, resulting in dynamic background elements becoming entangled with foreground object layers. Omnimatte3D and OmnimatteRF may also produce blurry background layers (e.g., horses) because their 3D-aware background representations are sensitive to camera pose estimation quality. Furthermore, these methods lack a generative and semantic prior for completing occluded pixels and accurately associating effects with their corresponding objects.
We compare our object-effect-removal model, Casper, with existing methods for object removal. Video inpainting models (ProPainter and Lumiere-Inpainting) fail to remove soft shadows and reflections outside the input masks. ObjectDrop is an image-based model, and thus, it processes each video frame independently and inpaints regions without global context and temporal consistency. We use the same ratio of mask dilation for all the methods.
Given an input video and binary object masks, we first apply our object-effect-removal model, Casper, to generate a clean-plate background and a set of single-object (solo) videos applying different trimask conditions. The trimasks specify regions to preserve (white), remove (black), and regions that potentially contain uncertain object effects (gray). In Stage 2, a test-time optimization reconstructs the omnimatte layers Oi from pairs of solo video and background video.
We use different trimask conditions for an input video to obtain a set of single-object (solo) videos and a clean-plate background video (bottom row). Note that we do not cherry pick the random seeds for the Casper model. We use the same random seed (=0) for all different input videos.
(i) Omnimatte: We collect omnimatte results from existing omnimatte methods (Omnimatte, Omnimatte3D, and OmnimatteRF) to provide examples of cause-and-effect relationships in real videos.
(ii) Tripod: The Tripod dataset consists of videos captured with stationary cameras, providing pseudo-examples of more complex real-world scenarios, such as water effects and dynamic backgrounds.
(iii) Kubric: We use Kubric to synthesize multi-object scenes with diverse reflections and shadows. we observe that many real-world scenarios exhibit multiple instances of the same object type in a scene, such as dogs, pedestrians, or vehicles.
Therefore, we generate scenes with duplicated objects to train the model to handle multiple similar objects.
(iv) Object-Paste: We segment objects from real videos and paste them onto target real videos to strengthen the model’s inpainting capabilities and background preservation.
For the synthesized Kubric and Object-Paste data, we randomly switch the gray and white colors to encourage the model to learn background preservation and inpainting for gray-labeled regions.
The training data is augmented through horizontal and temporal flipping, as well as random cropping.
We assess the individual contributions of each dataset category to our model's performance by incrementally adding each category to the training set. While the Omnimatte data provides basic examples of shadows in real-world videos, it primarily features static backgrounds and single objects. The Tripod data provides additional real-world scenarios to handle better water effects, such as reflections and boat wakes. Our Kubric synthetic data strengthens the models' ability to handle multi-object scenes. Finally, the Object-Paste data reduces undesired background changes and improves inpainting quality.
Our proposed trimask explicitly defines the regions to be removed or preserved, thereby enabling more accurate handling of multi-object scenarios. In contrast, the model trained on binary masks is susceptible to ambiguity, potentially leading to undesired removal of objects meant to be preserved.
While our method addresses the limitations of existing approaches, it still has certain limitations.
(i) Omnimatte layers handle color blending but are not designed to capture shape deformations as object effects.
Because of this limitation, we focused on training Casper to remove color-related effects (i.e., shadows and reflections),
so it may struggle to remove physical interactions with the environment, as observed in the trampoline and dog-agility examples.
(ii) The removal model may not always produce the desired outcome, particularly in challenging multi-object cases (e.g., the five-beagles and bowling examples).
(iii) In the dog-crosswalk example, the smaller dog is initially invisible, posing a challenge for our Casper model to complete the dog without reference to previous frames.
Furthermore, the challenging reflection effects on the crosswalk and road further complicate the removal process.
Casper could remove the reflection of the person but leave the larger dog's reflection on the road.
We observe some cases where Casper will associate unrelated dynamic background effects with a foreground layer, such as the waves in the below example. To mitigate this, our system allows the user to modify the trimask by specifying a coarse preservation region to preserve the background waves better.
To investigate the inherent understanding of object-effect associations in the text-to-video (T2V) Lumiere generation model,
we analyze its self-attention patterns during the denoising process using SDEdit.
We hypothesize that the T2V model possesses an intrinsic understanding of effect associations, allowing us to train an effective object-effect-removal model with a relatively small dataset.
We further compare the attention behaviors of the original T2V model, the Lumiere-Inpainting model, and our Casper model, which is sequentially fine-tuned from the T2V model.
To ensure accurate attention measurement, we do not dilate the input mask conditions for both Inpainting and Casper models.
The visualized value of each pixel indicates the strength of association between its query token and the key tokens in the target object mask region.
We visualize the first, middle, and final attention blocks of the U-Net architecture at the sampling step t=0.125.
For a detailed description of the attention visualization metric, please refer to Section 3.3 of our main paper.
We observe that the T2V model's object query tokens exhibit a strong focus on the object itself, as its primary task is to generate the object and its effects.
This tendency may also be present in the Inpainting model when it attempts to fill the mask region with another object to justify shadows.
In contrast, Casper's object query tokens show less self-attention and more attention to the background region, suggesting a focus on background completion rather than object and effect generation.
In multi-object scenarios (boys-beach,
five-beagles),
the T2V and Inpainting models may associate different, similar objects with the target object.
Our Casper model, however, demonstrates a lower attention response (darker) to similar objects, indicating a stronger ability to isolate individual objects.
We also analyzed the attention patterns of the failure case, five-beagles,
where our Casper model did not remove the corresponding shadow completely.
We hypothesize that the effect association is already weak in the T2V model, and our Casper model, inheriting knowledge from the pretrained models, struggles to handle such challenging cases.
@article{generative-omnimatte,
author = {Lee, Yao-Chih and Lu, Erika and Rumbley, Sarah and Geyer, Michal and Huang, Jia-Bin and Dekel, Tali and Cole, Forrester},
title = {Generative Omnimatte: Learning to Decompose Video into Layers},
journal = {arXiv preprint arXiv:2411.16683},
year = {2024},
}