1 KAIST 2 Google Research * Denotes equal contribution |
Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as “particles” in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models. |
(Left) Instruct-Pix2Pix, when applied to downscaled images of 512 x 512, produces low-quality results and loses many details after edits.
(Center) Our method, CSD-Edit, provides consistent image editing between patches, achieving the best fidelity to the given instruction. (Right) Instruct-Pix2Pix, when used on cropped patches, results in inconsistent image editing across patches. |
CSD-Edit facilitates high-quality editing of images with a resolution beyond 4K (3991x4395) in accordance with the provided prompts. |
CSD-Edit demonstrates consistent and coherent editing across patches in panorama images. Thus, it provides the unique ability to manipulate each patch according to different instructions, while maintaining overall structure of the source image coherently and ensuring smooth transition between patches with different instructions. |
CSD-Edit demonstrates consistent and coherent editing of objects in the source panorama image. |
We compare our method, CSD-Edit, with existing zero-shot video editing schemes that employ
text-to-image diffusion models, including
FateZero, Pix2Video.
Additionally, we compare with Gen-1,
a video diffusion model trained using a large video dataset.
Compared to these baselines, CSD-Edit offers three distinct advantages: 1) It facilitates the consistent editing of source videos, ensuring a smooth transition between frames by providing coherent content throughout. Also, 2) it showcases high-quality video edits that properly reflect the given instructions. Furthermore, 3) CSD-Edit effectively preserves the background (or irrelevant regions to the given instruction), thereby maintaining the overall context of the video. |
We compare our method, CSD-Edit, with the existing 3D scene editing scheme, Instruct-NeRF2NeRF. As CSD-Edit enables consistent editing of multi-view images, it delivers clear, high-quality edits without creating blurred artifacts (e.g., graphics and anime). Moreover, our method allows for precise control over source 3D scenes (e.g., smile) without causing substantial changes to other parts of the image. |
We compare our method, CSD, with Score Distillation Sampling (SDS), introduced in DreamFusion. Considering the scores of multi-view samples, CSD provides three notable advantages: 1) As shown in the first row, CSD excels at capturing coherent geometry, outperforming SDS in this aspect. 2) As illustrated in the second row, CSD allows for the learning of finer details compared to SDS. 3) Building on the second advantage, and as displayed in the third row, CSD can produce diverse, high-quality samples without changing random seeds |
This template was originally made by Subin Kim and Sihyun Yu for a NVP project. |