DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation 论文阅读

date
Dec 27, 2022
Last edited time
Mar 27, 2023 08:39 AM
status
Published
slug
DiffusionCLIP论文阅读
tags
DL
CV
summary
type
Post
Field
Plat
notion image

Abstract

  • Problem
    • Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zeroshot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability. Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts.
  • Method
    • We propose a novel DiffusionCLIP - a CLIP-guided robust image manipulation method by diffusion models.
      Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation.
      notion image

Background

CLIP Guidance for Image Manipulation

To effectively extract knowledge from CLIP, two different losses have been proposed: a global target loss, and local directional loss.
  • Target Loss
    • The global CLIP loss tries to minimize the cosine distance in the CLIP space between the generated image and a given target text:
  • Local Directional Loss
    • Local directional loss is designed to alleviate the issues of global CLIP loss such as low diversity and susceptibility to adversarial attacks.
      where
      Here, and are CLIP’s image and text encoders, respectively, and , are the source domain text and image, respectively.
      The manipulated images guided by the directional CLIP loss are known robust to mode-collapse issues. Also, it is more robust to adversarial attacks.

Method

The input image is first converted to the latent using a pretrained diffusion model .
notion image

DiffusionCLIP Fine-tuning

To fine-tune the reverse diffusion model , we use the following objective composed of the directional CLIP loss and the identity loss .

Forward Diffusion and Generative Process

To fully leverage the image synthesis performance of diffusion models with the purpose of image manipulation, we require the deterministic process both in the forward and reverse direction with pretrained diffusion models for successful image manipulation.
notion image
We adopt deterministic reverse DDIM process as generative process and ODE approximation of its reversal as a forward diffusion process.
💡
and the deterministic reverse DDIM process to generate sample from the obtained latent becomes:
where is a the prediction of at given and :
 

Image Translation between Unseen Domains

We can perform image translation from an unseen domain to another unseen domain.
A key idea to address this difficult problem is to bridge between two domains by inserting the diffusion models trained on the dataset that is relatively easy to collect.
notion image

Noise Combination

We discover that when the noises predicted from multiple fine-tuned models are combined during the sampling, multiple attributes can be changed through only one sampling process as described in Fig. 4(d).
In detail, we first invert the image with the original pretrained diffusion model and use the multiple diffusion models by the following sampling rule:
where is the sequence of weights of each finetuned model satisfying .
Continuous transition
We can also apply the above noise combination method for controlling the degree of change during single attribute manipulation. By mixing the noise from the original pretrained model and the fine-tuned model with respect to a degree of change , we can perform interpolation between the original image and the manipulated image smoothly.

Experiments

notion image
notion image

Attributes Manipulation

notion image
notion image
notion image
notion image
notion image

© Lazurite 2021 - 2024