DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery 论文阅读

date

May 25, 2023

Last edited time

May 25, 2023 03:37 PM

status

Published

slug

DiffusionSeg论文阅读

Abstract

当前的 Foundation models 可以分为两类：判别式和生成式。判别式模型（如 MoCo、DINO 和 CLIP）是通过对同一类别或相应标题下的图像进行对齐训练的，因此它们更擅长高级语义任务，如分类和检索。而生成式模型（如 MAE 和 Diffusion）则捕捉了低级视觉知识和高级语义关系，即在像素方面更好级处理任务，例如重建和分割。虽然具有低级和高级视觉知识的生成预训练现在仍然停留在低级任务的有限应用中，例如图像生成、着色、视觉修复.

Motivation

基于生成的预训练对于主流的判别任务是否也更有价值？

本文提出了一种名为DiffusionSeg的方法，利用预训练扩散模型进行无监督对象发现。该方法可以自动地从图像中发现对象，并生成像素级别的对象掩码。DiffusionSeg通过设计一种新颖的合成-利用框架，从数据合成中显式提取图片对应的mask，并通过扩散反演来模型内部的特征。

Method

本文旨在利用来自预先训练的扩散生成模型的像素级视觉知识，用于下游判别任务，例如 OD。

Image Generation

对于一个预训练的文本到图像稳定扩散，我们在这里冻结它，然后通过输入随机高斯噪声和类文本提示生成图像。类名是从 ImageNet 中采样的。

Mask Generation

我们通过利用预先训练的扩散模型中的注意力作为线索，根据两个重要的观察结果生成高质量的掩模。

Cross-attention 表示条件文本和噪声图像之间的局部性，它可以描述条件文本和噪声图像之间的局部关系，并粗略地描述对象性质。

Self-attention 表示像素之间的成对语义相似性，并粗略地描述图像的连贯性。

受此启发，我们提出了 AttentionCut，这是一种无需训练的策略，可生成由注意力图引导的掩码。

AttentionCut

Preparations.

We first extract and at the position of category token in the prompt sentence, then aggregate different resolutions and timesteps considering multi-scale objects and avoiding focus shift during diffusion. Formally,

where is for each reverse step and is for intermediate layers. is averaged among the top- of the standard variation from all , while is averaged among all layers and time steps.

Objectness.

Intuitively, the pixel-level cross-attention under a specific category can roughly be seen as segmentation masks, as it indicates how likely a pixel belongs to the category. However, in practice we found is sparse and inattentive near the boundary, which can seriously damage segmentation results. To handle this issue, we improve by strengthening the edge area with the self-attention . It indicates semantic connectivity, how semantically two pixels belong to one group. Specifically, we first randomly select a set of initial seeds from the boundary of the binary mask . Then each selected seed can expand as a confidence map , which is the self-attention between and other pixels, indicating weights of the boundary area. We assume , as is symmetric theoretically. For pixel , these maps are averaged as a refined map , to reinforce the boundary pixels:

Combining cross-attention and the refined map with a balance weight , the pixel-level objectness are:

where is the cross-attention at pixel .

💡

使用来补充 .

Inner Coherence.

With only objectness, we found that the masks tend to lose local information, for example, irregular corners, mis-segmented holes, or jagged contours. This can be solved by taking local consistency into account, how likely two neighboring pixels belong to one group. Here we design an inner coherence term that can help to enforce continuity, proximity and smoothness of segments belonging to the same object, and penalize those who deviate.

The proposed inner coherence consists of two parts: semantic and spatial. As mentioned above, can indicate semantic coherence, as self-attention is calculated in semantic feature space. Spatial coherence is designed to indicate pixels pairwise distance in both RGB and Euclidian space. This coherence is obtained by absorbing the form of geodesic distance on the surface of image intensity, then by negative exponential transformation. The inner coherence can be formalized as:

where for pixel and , is the self-attention and is the geodesic distance; is an arbitrary path from to parameterized by ; denotes the unit vector that is tangent to the path direction; is image RGB intensity.

💡

: 两个语义相同的点在空间上经过的路径需要的最少的图像梯度累加. 说明结构相似度越高