A Generalist Framework for Panoptic Segmentation of Images and Videos 论文阅读

date

Dec 1, 2022

Last edited time

Mar 28, 2023 08:05 AM

status

Published

slug

A_Generalist_Framework_for_Panoptic_Segmentation_of_Images_and_Videos论文阅读

Abstract

Problem formulate

we formulate the panoptic segmentation task as a conditional discrete data generation problem. Panoptic segmentation masks can be expressed with two channels, .The first represents the category/class label. The second is the instance ID.

However generative modeling for panoptic segmentation is very challenging as the panoptic masks are discrete/categorical and can be very large. To generate a 512×1024 panoptic mask, for example, the model has to produce more than 1M discrete tokens (of semantic and instance labels). This is expensive for auto-regressive models as they are inherently sequential, scaling poorly with the size of data input.

Diffusion Model

Diffusion models are better at handling high dimension data but they are most commonly applied to continuous rather than discrete domains.

💡

For the instance ID generation.

By representing discrete data with analog bits we show that one can train a diffusion model on large panoptic masks directly.

Method

We intentionally separate the network into two components: 1) an image encoder; and 2) a mask decoder. The former maps raw pixel data into high-level representation vectors, and then the mask decoder iteratively reads out the panoptic mask.

💡

Then we can extract image feature only once during inference.

Pixel/image Encoder

The encoder is a network that maps a raw image into a feature map in where and are the height and width of the panoptic mask.

Mask Decoder

It takes as input the concatenation of image feature map from encoder and a noisy mask (randomly initialized or from previous step), and outputs a refined prediction of the mask.

Analog Bits with Input Scaling

The analog bits are real numbers converted from the integers of panoptic masks. When constructing the analog bits, we can shift and scale them into .

Typically, b is set to be 1 but we find that adjusting this scaling factor has an significant effect on the performance of the model. This scaling factor effectively allows one to adjust the signal-to-noise ratio of the diffusion process (or the noise schedule), as visualized in Fig. 3. With , we see that even with a large time step (with ), the signal-to-noise ratio is still relatively high, so the masks are visible to naked eye and the model can easily recover the mask without using the encoded image features. With , however, the denoising task becomes significantly harder as the signal-to-noise ratio is reduced. In our study, we find works substantially better than the default of .