Note: If some visuals are not displaying correctly, please try refreshing the page
Best viewed on a monitor and with a Chrome browser
TL; DR: Consistent zero-shot visual editing across various and complex visual modalities
Generative priors of large-scale text-to-image diffusion models enable a wide range of
new generation and editing applications on diverse visual modalities.
However, when adapting these priors to complex visual modalities, often represented as
multiple images (e.g., video), achieving consistency across a set of images is challenging.
In this paper, we address this challenge with a novel method, Collaborative Score Distillation
CSD is based on the Stein Variational Gradient Descent (SVGD).
Specifically, we propose to consider multiple samples as “particles” in the SVGD update
and combine their score functions to distill generative priors over a set of images synchronously.
Thus, CSD facilitates seamless integration of information across 2D images,
leading to a consistent visual synthesis across multiple samples.
We show the effectiveness of CSD in a variety of tasks,
encompassing the visual editing of panorama images, videos, and 3D scenes.
Our results underline the competency of CSD as a versatile method
for enhancing inter-sample consistency,
thereby broadening the applicability of text-to-image diffusion models.
"Re-imagine people are in galaxy"
(Left) Instruct-Pix2Pix, when applied to downscaled images of 512 x 512, produces low-quality results and loses many details after edits.
(Center) Our method, CSD-Edit, provides consistent image editing between patches, achieving the best fidelity to the given instruction.
(Right) Instruct-Pix2Pix, when used on cropped patches, results in inconsistent image editing across patches.
Press the button located below each visual to view the changes and enjoy the visuals
Spatial Consistency: Image Editing Beyond 512x512
4K+ Resolution Image Editing
CSD-Edit facilitates high-quality editing of images with a resolution beyond 4K (3991x4395) in
accordance with the provided prompts.
For a detailed view of the edits, hover your cursor over the image to zoom in.
"Re-imagine people are in galaxy"
Compositional Editing of Panorama Images
CSD-Edit demonstrates consistent and coherent editing across patches in panorama images.
Thus, it provides the unique ability to manipulate each patch according to different
while maintaining overall structure of the source image coherently
and ensuring smooth transition between patches with different instructions.
Object Editing of Panorama Images
CSD-Edit demonstrates consistent and coherent editing of objects in the source panorama image.
Temporal Consistency: Video Editing
We compare our method, CSD-Edit, with existing zero-shot video editing schemes that employ
text-to-image diffusion models, including
Additionally, we compare with Gen-1,
a video diffusion model trained using a large video dataset.
Compared to these baselines, CSD-Edit offers three distinct advantages:
1) It facilitates the consistent editing of source videos,
ensuring a smooth transition between frames by providing coherent content throughout.
Also, 2) it showcases high-quality video edits that properly reflect the given instructions.
Furthermore, 3) CSD-Edit effectively preserves the background (or irrelevant regions to the given instruction),
thereby maintaining the overall context of the video.
Comparison to Baselines
"Make it spring"
"Give him a yellow T-shirt""
View Consisteny: 3D Scene Synthesis
3D Scene Editing Comparison to Baselines
We compare our method, CSD-Edit, with the existing 3D scene editing scheme,
As CSD-Edit enables consistent editing of multi-view images, it delivers clear,
high-quality edits without creating blurred artifacts (e.g., graphics and anime).
Moreover, our method allows for precise control over source 3D scenes (e.g., smile)
without causing substantial changes to other parts of the image.
"Re-imagine him as a glowing colorful vaporwave 3D
low-poly graphic object"
Text-to-3D Synthesis Comparison to Baselines
We compare our method, CSD, with Score Distillation Sampling (SDS), introduced in
Considering the scores of multi-view samples, CSD provides three notable advantages:
1) As shown in the first row, CSD excels at capturing coherent geometry, outperforming SDS in this
2) As illustrated in the second row, CSD allows for the learning of finer details compared to SDS.
3) Building on the second advantage, and as displayed in the third row, CSD can produce diverse,
high-quality samples without changing random seeds