A Generalist Framework for Panoptic Segmentation of Images and Videos 论文阅读

date
Dec 1, 2022
Last edited time
Mar 28, 2023 08:05 AM
status
Published
slug
A_Generalist_Framework_for_Panoptic_Segmentation_of_Images_and_Videos论文阅读
tags
DL
DDPM
summary
type
Post
Field
Plat

Abstract

  • Problem formulate
    • notion image
      we formulate the panoptic segmentation task as a conditional discrete data generation problem. Panoptic segmentation masks can be expressed with two channels, .The first represents the category/class label. The second is the instance ID.
      However generative modeling for panoptic segmentation is very challenging as the panoptic masks are discrete/categorical and can be very large. To generate a 512×1024 panoptic mask, for example, the model has to produce more than 1M discrete tokens (of semantic and instance labels). This is expensive for auto-regressive models as they are inherently sequential, scaling poorly with the size of data input.
  • Diffusion Model
    • Diffusion models are better at handling high dimension data but they are most commonly applied to continuous rather than discrete domains.
      💡
      For the instance ID generation.
      By representing discrete data with analog bits we show that one can train a diffusion model on large panoptic masks directly.

Method

notion image
We intentionally separate the network into two components: 1) an image encoder; and 2) a mask decoder. The former maps raw pixel data into high-level representation vectors, and then the mask decoder iteratively reads out the panoptic mask.
💡
Then we can extract image feature only once during inference.
notion image
Pixel/image Encoder
The encoder is a network that maps a raw image into a feature map in where and are the height and width of the panoptic mask.
Mask Decoder
It takes as input the concatenation of image feature map from encoder and a noisy mask (randomly initialized or from previous step), and outputs a refined prediction of the mask.

Analog Bits with Input Scaling

The analog bits are real numbers converted from the integers of panoptic masks. When constructing the analog bits, we can shift and scale them into .
Typically, b is set to be 1 but we find that adjusting this scaling factor has an significant effect on the performance of the model. This scaling factor effectively allows one to adjust the signal-to-noise ratio of the diffusion process (or the noise schedule), as visualized in Fig. 3. With , we see that even with a large time step (with ), the signal-to-noise ratio is still relatively high, so the masks are visible to naked eye and the model can easily recover the mask without using the encoded image features. With , however, the denoising task becomes significantly harder as the signal-to-noise ratio is reduced. In our study, we find works substantially better than the default of .
notion image
notion image

Softmax Cross Entropy Loss

We also discovered that a loss based on softmax cross entropy yields better performance than denoising loss.
notion image

Loss Weighting

For panoptic segmentation, with a loss defined over pixels, this means that large objects will have more influence than small objects.
We use an adaptive loss to improve the segmentation of small instances by giving higher weights to mask tokens associated with small objects.
notion image

Experiments

notion image
notion image
notion image
 

© Lazurite 2021 - 2023