DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation 论文阅读
date
Dec 27, 2022
Last edited time
Mar 27, 2023 08:39 AM
status
Published
slug
DiffusionCLIP论文阅读
tags
DL
CV
summary
type
Post
Field
Plat
AbstractBackgroundCLIP Guidance for Image ManipulationMethodDiffusionCLIP Fine-tuningForward Diffusion and Generative ProcessImage Translation between Unseen DomainsNoise CombinationExperimentsAttributes Manipulation
Abstract
- Problem
Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zeroshot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability. Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts.
- Method
We propose a novel DiffusionCLIP - a CLIP-guided robust image manipulation method by diffusion models.
Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation.
Background
CLIP Guidance for Image Manipulation
To effectively extract knowledge from CLIP, two different losses have been proposed: a global target loss, and local directional loss.
- Target Loss
The global CLIP loss tries to minimize the cosine distance in the CLIP space between the generated image and a given target text:
- Local Directional Loss
Local directional loss is designed to alleviate the issues of global CLIP loss such as low diversity and susceptibility to adversarial attacks.
where
Here, and are CLIP’s image and text encoders, respectively, and , are the source domain text and image, respectively.
The manipulated images guided by the directional CLIP loss are known robust to mode-collapse issues. Also, it is more robust to adversarial attacks.
Method
The input image is first converted to the latent using a pretrained diffusion model .
DiffusionCLIP Fine-tuning
To fine-tune the reverse diffusion model , we use the following objective composed of the directional CLIP loss and the identity loss .
Forward Diffusion and Generative Process
To fully leverage the image synthesis performance of diffusion models with the purpose of image manipulation, we require the deterministic process both in the forward and reverse direction with pretrained diffusion models for successful image manipulation.
We adopt deterministic reverse DDIM process as generative process and ODE approximation of its reversal as a forward diffusion process.
and the deterministic reverse DDIM process to generate sample from the obtained latent becomes:
where is a the prediction of at given and :
Image Translation between Unseen Domains
We can perform image translation from an unseen domain to another unseen domain.
A key idea to address this difficult problem is to bridge between two domains by inserting the diffusion models trained on the dataset that is relatively easy to collect.
Noise Combination
We discover that when the noises predicted from multiple fine-tuned models are combined during the sampling, multiple attributes can be changed through only one sampling process as described in Fig. 4(d).
In detail, we first invert the image with the original pretrained diffusion model and use the multiple diffusion models by the following sampling rule:
where is the sequence of weights of each finetuned model satisfying .
Continuous transition
We can also apply the above noise combination method for controlling the degree of change during single attribute manipulation. By mixing the noise from the original pretrained model and the fine-tuned model with respect to a degree of change , we can perform interpolation between the original image and the manipulated image smoothly.