DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation 论文阅读

date

Dec 27, 2022

Last edited time

Mar 27, 2023 08:39 AM

status

Published

slug

DiffusionCLIP论文阅读

tags

summary

type

Post

origin

https://www.notion.so/lazurite/DiffusionCLIP-Text-Guided-Diffusion-Models-for-Robust-Image-Manipulation-220bf84fd7344641b8cb99fc8d422247

Field

Plat

DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation

Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zero-shot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability. Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts.

https://arxiv.org/abs/2110.02711

GitHub - gwang-kim/DiffusionCLIP: [CVPR 2022] Official PyTorch Implementation for DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models

DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation Gwanghyun Kim, Taesung Kwon, Jong Chul Ye CVPR 2022 Abstract: Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zero-shot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability.

https://github.com/gwang-kim/DiffusionCLIP

Kim 等 - 2022 - DiffusionCLIP Text-Guided Diffusion Models for Ro.pdf

28052.7KB

Abstract Background CLIP Guidance for Image Manipulation Method DiffusionCLIP Fine-tuning Forward Diffusion and Generative Process Image Translation between Unseen Domains Noise Combination Experiments Attributes Manipulation

Abstract

Problem

Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zeroshot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability. Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts.

Method

We propose a novel DiffusionCLIP - a CLIP-guided robust image manipulation method by diffusion models.

Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation.

Background

CLIP Guidance for Image Manipulation

To effectively extract knowledge from CLIP, two different losses have been proposed: a global target loss, and local directional loss.

Target Loss

The global CLIP loss tries to minimize the cosine distance in the CLIP space between the generated image and a given target text:

Local Directional Loss

Local directional loss is designed to alleviate the issues of global CLIP loss such as low diversity and susceptibility to adversarial attacks.

where

Here, and are CLIP’s image and text encoders, respectively, and , are the source domain text and image, respectively.

The manipulated images guided by the directional CLIP loss are known robust to mode-collapse issues. Also, it is more robust to adversarial attacks.

Method

The input image is first converted to the latent using a pretrained diffusion model .

DiffusionCLIP Fine-tuning

To fine-tune the reverse diffusion model , we use the following objective composed of the directional CLIP loss and the identity loss .

Forward Diffusion and Generative Process

To fully leverage the image synthesis performance of diffusion models with the purpose of image manipulation, we require the deterministic process both in the forward and reverse direction with pretrained diffusion models for successful image manipulation.

We adopt deterministic reverse DDIM process as generative process and ODE approximation of its reversal as a forward diffusion process.

💡

and the deterministic reverse DDIM process to generate sample from the obtained latent becomes:

where is a the prediction of at given and :

Image Translation between Unseen Domains

We can perform image translation from an unseen domain to another unseen domain.

A key idea to address this difficult problem is to bridge between two domains by inserting the diffusion models trained on the dataset that is relatively easy to collect.

Noise Combination

We discover that when the noises predicted from multiple fine-tuned models are combined during the sampling, multiple attributes can be changed through only one sampling process as described in Fig. 4(d).

In detail, we first invert the image with the original pretrained diffusion model and use the multiple diffusion models by the following sampling rule:

where is the sequence of weights of each finetuned model satisfying .

Continuous transition

We can also apply the above noise combination method for controlling the degree of change during single attribute manipulation. By mixing the noise from the original pretrained model and the fine-tuned model with respect to a degree of change , we can perform interpolation between the original image and the manipulated image smoothly.