Derivation of formulas in Score-Based Models

date
May 22, 2023
Last edited time
Jul 8, 2023 08:36 AM
status
Published
slug
Derivation-of-formulas-in-Score-Based-Models
tags
DL
summary
type
Post
Field
Plat

Introduction

Score and Score-Based Models

Given a probablity density function , we define the score as As you might guess, score-based generative models are trained to estimate . Unlike likelihood-based models such as flow models or autoregressive models, score-based models do not have to be normalized and are easier to parameterize.
For example, consider a non-normalized statistical model , where is called the energy function and is an unknown normalizing constant that makes a proper probability density function. The energy function is typically parameterized by a flexible neural network. When training it as a likelihood model, we need to know the normalizing constant by computing complex high-dimensional integrals, which is typically intractable. In constrast, when computing its score, we obtain which does not require computing the normalizing constant .
In fact, any neural network that maps an input vector to an output vector can be used as a score-based model, as long as the output and input have the same dimensionality. This yields huge flexibility in choosing model architectures.

Perturbing Data with a Diffusion Process

In order to generate samples with score-based models, we need to consider a diffusion process that corrupts data slowly into random noise. Scores will arise when we reverse this diffusion process for sample generation. You will see this later in the notebook.
A diffusion process is a stochastic process similar to Brownian motion. Their paths are like the trajectory of a particle submerged in a flowing fluid, which moves randomly due to unpredictable collisions with other particles. Let be a diffusion process, indexed by the continuous time variable . A diffusion process is governed by a stochastic differential equation (SDE), in the following form

💡
The perturbation kernels of this SDE have the general form:
where denotes the probability density function of evaluated at
The marginal distribution is obtained by integrating the perturbation kernels over :
来自 EDM 的补充材料 B1

where is called the drift coefficient of the SDE, is called the diffusion coefficient, and represents the standard Brownian motion.
💡
漂流项, 扩散项
You can understand an SDE as a stochastic generalization to ordinary differential equations (ODEs). Particles moving according to an SDE not only follows the deterministic drift , but are also affected by the random noise coming from . From now on, we use to denote the distribution of .
For score-based generative modeling, we will choose a diffusion process such that , and . Here is the data distribution where we have a dataset of i.i.d. samples, and is the prior distribution that has a tractable form and easy to sample from. The noise perturbation by the diffusion process is large enough to ensure does not depend on .

Reversing the Diffusion Process Yields Score-Based Generative Models

By starting from a sample from the prior distribution and reversing the diffusion process, we will be able to obtain a sample from the data distribution . Crucially, the reverse process is a diffusion process running backwards in time. It is given by the following reverse-time SDE
where is a Brownian motion in the reverse time direction, and represents an infinitesimal negative time step. This reverse SDE can be computed once we know the drift and diffusion coefficients of the forward SDE, as well as the score of for each .
The overall intuition of score-based generative modeling with SDEs can be summarized in the illustration below
notion image

Score Estimation

Based on the above intuition, we can use the time-dependent score function to construct the reverse-time SDE, and then solve it numerically to obtain samples from using samples from a prior distribution . We can train a time-dependent score-based model to approximate , using the following weighted sum of denoising score matching objectives.
where is a uniform distribution over , denotes the transition probability from to , and denotes a positive weighting function.
In the objective, the expectation over can be estimated with empirical means over data samples from . The expectation over can be estimated by sampling from , which is efficient when the drift coefficient is affine. The weight function is typically chosen to be inverse proportional to .

Training with Weighted Sum of Denoising Score Matching Objectives

Now let's get our hands dirty on training. First of all, we need to specify an SDE that perturbs the data distribution to a prior distribution . We choose the following SDE
💡
只扩散数据的方差 不改变均值
In this case,
💡
.
and we can choose the weighting function .
When is large, the prior distribution, is
which is approximately independent of the data distribution and is easy to sample from.
Intuitively, this SDE captures a continuum of Gaussian perturbations with variance function . This continuum of perturbations allows us to gradually transfer samples from a data distribution to a simple Gaussian distribution .

Sampling with Numerical SDE Solvers

Recall that for any SDE of the form
the reverse-time SDE is given by
Since we have chosen the forward SDE to be
The reverse-time SDE is given by
To sample from our time-dependent score-based model , we first draw a sample from the prior distribution , and then solve the reverse-time SDE with numerical methods.
In particular, using our time-dependent score-based model, the reverse-time SDE can be approximated by
Next, one can use numerical methods to solve for the reverse-time SDE, such as the Euler-Maruyama approach. It is based on a simple discretization to the SDE, replacing with and with . When applied to our reverse-time SDE, we can obtain the following iteration rule
where .

Sampling with Predictor-Corrector Methods

Aside from generic numerical SDE solvers, we can leverage special properties of our reverse-time SDE for better solutions. Since we have an estimate of the score of via the score-based model, i.e., , we can leverage score-based MCMC approaches, such as Langevin MCMC, to correct the solution obtained by numerical SDE solvers.
Score-based MCMC approaches can produce samples from a distribution once its score is known. For example, Langevin MCMC operates by running the following iteration rule for :
where , is the step size, and is initialized from any prior distribution . When and , the final value becomes a sample from under some regularity conditions. Therefore, given , we can get an approximate sample from by running several steps of Langevin MCMC, replacing with in the iteration rule.
Predictor-Corrector samplers combine both numerical solvers for the reverse-time SDE and the Langevin MCMC approach. In particular, we first apply one step of numerical SDE solver to obtain from , which is called the "predictor" step. Next, we apply several steps of Langevin MCMC to refine , such that becomes a more accurate sample from . This is the "corrector" step as the MCMC helps reduce the error of the numerical SDE solver.

Sampling with Numerical ODE Solvers

For any SDE of the form
there exists an associated ordinary differential equation (ODE)
such that their trajectories have the same mariginal probability density . Therefore, by solving this ODE in the reverse time direction, we can sample from the same distribution as solving the reverse-time SDE. We call this ODE the probability flow ODE.
Below is a schematic figure showing how trajectories from this probability flow ODE differ from SDE trajectories, while still sampling from the same distribution.
notion image
Therefore, we can start from a sample from , integrate the ODE in the reverse time direction, and then get a sample from . In particular, for the SDE in our running example, we can integrate the following ODE from to for sample generation
This can be done using many black-box ODE solvers provided by packages such as scipy.

© Lazurite 2021 - 2024