Abstract

Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as “particles” in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

4K+ Resolution Image Editing

CSD-Edit facilitates high-quality editing of images with a resolution beyond 4K (3991x4395) in accordance with the provided prompts.

For a detailed view of the edits, hover your cursor over the image to zoom in.

"Re-imagine people are in galaxy"

Compositional Editing of Panorama Images

CSD-Edit demonstrates consistent and coherent editing across patches in panorama images. Thus, it provides the unique ability to manipulate each patch according to different instructions, while maintaining overall structure of the source image coherently and ensuring smooth transition between patches with different instructions.

Original image

Object Editing of Panorama Images

CSD-Edit demonstrates consistent and coherent editing of objects in the source panorama image.

Original image

We compare our method, CSD-Edit, with existing zero-shot video editing schemes that employ text-to-image diffusion models, including FateZero, Pix2Video. Additionally, we compare with Gen-1, a video diffusion model trained using a large video dataset.

Compared to these baselines, CSD-Edit offers three distinct advantages: 1) It facilitates the consistent editing of source videos, ensuring a smooth transition between frames by providing coherent content throughout. Also, 2) it showcases high-quality video edits that properly reflect the given instructions. Furthermore, 3) CSD-Edit effectively preserves the background (or irrelevant regions to the given instruction), thereby maintaining the overall context of the video.

Comparison to Baselines

"Make it spring"

"Give him a yellow T-shirt""

3D Scene Editing Comparison to Baselines

We compare our method, CSD-Edit, with the existing 3D scene editing scheme, Instruct-NeRF2NeRF. As CSD-Edit enables consistent editing of multi-view images, it delivers clear, high-quality edits without creating blurred artifacts (e.g., graphics and anime). Moreover, our method allows for precise control over source 3D scenes (e.g., smile) without causing substantial changes to other parts of the image.

Source

Instruct-NeRF2NeRF

CSD-Edit (Ours)

"Re-imagine him as a glowing colorful vaporwave 3D low-poly graphic object"

Text-to-3D Synthesis Comparison to Baselines

We compare our method, CSD, with Score Distillation Sampling (SDS), introduced in DreamFusion. Considering the scores of multi-view samples, CSD provides three notable advantages: 1) As shown in the first row, CSD excels at capturing coherent geometry, outperforming SDS in this aspect. 2) As illustrated in the second row, CSD allows for the learning of finer details compared to SDS. 3) Building on the second advantage, and as displayed in the third row, CSD can produce diverse, high-quality samples without changing random seeds

Note: If some visuals are not displaying correctly, please try refreshing the page

Best viewed on a monitor and with a Chrome browser

TL; DR: Consistent zero-shot visual editing across various and complex visual modalities

Abstract

Press the button located below each visual to view the changes and enjoy the visuals

Spatial Consistency: Image Editing Beyond 512x512

4K+ Resolution Image Editing

For a detailed view of the edits, hover your cursor over the image to zoom in.

Compositional Editing of Panorama Images

Object Editing of Panorama Images

Temporal Consistency: Video Editing

Comparison to Baselines

View Consisteny: 3D Scene Synthesis

3D Scene Editing Comparison to Baselines

Text-to-3D Synthesis Comparison to Baselines

Acknowledgements

Note: If some visuals are not displaying correctly, please try refreshing the page Best viewed on a monitor and with a Chrome browser

TL; DR: Consistent zero-shot visual editing across various and complex visual modalities

Abstract

Press the button located below each visual to view the changes and enjoy the visuals

Spatial Consistency: Image Editing Beyond 512x512

4K+ Resolution Image Editing

For a detailed view of the edits, hover your cursor over the image to zoom in.

Compositional Editing of Panorama Images

Object Editing of Panorama Images

Temporal Consistency: Video Editing

Comparison to Baselines

View Consisteny: 3D Scene Synthesis

3D Scene Editing Comparison to Baselines

Text-to-3D Synthesis Comparison to Baselines

Acknowledgements

Note: If some visuals are not displaying correctly, please try refreshing the page

Best viewed on a monitor and with a Chrome browser