|
69 | 69 | q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I}) |
70 | 70 | $$ |
71 | 71 |
|
72 | | -Using Bayes’ rule, we have: |
| 72 | +Using Bayes' rule, we have: |
73 | 73 |
|
74 | 74 | $$ |
75 | 75 | \begin{aligned} |
|
120 | 120 | \end{aligned} |
121 | 121 | $$ |
122 | 122 |
|
123 | | -It is also straightforward to get the same result using Jensen’s inequality. Say we want to minimize the cross entropy as the learning objective, |
| 123 | +It is also straightforward to get the same result using Jensen's inequality. Say we want to minimize the cross entropy as the learning objective, |
124 | 124 |
|
125 | 125 | $$ |
126 | 126 | \begin{aligned} |
@@ -151,7 +151,7 @@ L_\text{VLB} |
151 | 151 | \end{aligned} |
152 | 152 | $$ |
153 | 153 |
|
154 | | -Let’s label each component in the variational lower bound loss separately: |
| 154 | +Let's label each component in the variational lower bound loss separately: |
155 | 155 |
|
156 | 156 | $$ |
157 | 157 | \begin{aligned} |
@@ -209,7 +209,7 @@ where $C$ is a constant not depending on $\theta$. |
209 | 209 |
|
210 | 210 | #### Connection with noise-conditioned score networks (NCSN) |
211 | 211 |
|
212 | | -[Song & Ermon (2019)](/posts/diffusion-models/https://arxiv.org/abs/1907.05600) proposed a score-based generative modeling method where samples are produced via [Langevin dynamics](/posts/diffusion-models/#connection-with-stochastic-gradient-langevin-dynamics) using gradients of the data distribution estimated with score matching. The score of each sample $\mathbf{x}$’s density probability is defined as its gradient $\nabla_{\mathbf{x}} \log q(\mathbf{x})$. A score network $\mathbf{s}_\theta: \mathbb{R}^D \to \mathbb{R}^D$ is trained to estimate it, $\mathbf{s}_\theta(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log q(\mathbf{x})$. |
| 212 | +[Song & Ermon (2019)](/posts/diffusion-models/https://arxiv.org/abs/1907.05600) proposed a score-based generative modeling method where samples are produced via [Langevin dynamics](/posts/diffusion-models/#connection-with-stochastic-gradient-langevin-dynamics) using gradients of the data distribution estimated with score matching. The score of each sample $\mathbf{x}$'s density probability is defined as its gradient $\nabla_{\mathbf{x}} \log q(\mathbf{x})$. A score network $\mathbf{s}_\theta: \mathbb{R}^D \to \mathbb{R}^D$ is trained to estimate it, $\mathbf{s}_\theta(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log q(\mathbf{x})$. |
213 | 213 |
|
214 | 214 | To make it scalable with high-dimensional data in the deep learning setting, they proposed to use either *denoising score matching* ([Vincent, 2011](/posts/diffusion-models/http://www.iro.umontreal.ca/~vincentp/Publications/smdae_techreport.pdf)) or *sliced score matching* (use random projections; [Song et al., 2019](/posts/diffusion-models/https://arxiv.org/abs/1905.07088)). Denosing score matching adds a pre-specified small noise to the data $q(\tilde{\mathbf{x}} \vert \mathbf{x})$ and estimates $q(\tilde{\mathbf{x}})$ with score matching. |
215 | 215 |
|
@@ -317,7 +317,7 @@ It is very slow to generate a sample from DDPM by following the Markov chain of |
317 | 317 |
|
318 | 318 | One simple way is to run a strided sampling schedule ([Nichol & Dhariwal, 2021](https://arxiv.org/abs/2102.09672)) by taking the sampling update every $\lceil T/S \rceil$ steps to reduce the process from $T$ to $S$ steps. The new sampling schedule for generation is $\{\tau_1, \dots, \tau_S\}$ where $\tau_1 < \tau_2 < \dots <\tau_S \in [1, T]$ and $S < T$. |
319 | 319 |
|
320 | | -For another approach, let’s rewrite $q_\sigma(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)$ to be parameterized by a desired standard deviation $\sigma_t$ according to the nice property: |
| 320 | +For another approach, let's rewrite $q_\sigma(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)$ to be parameterized by a desired standard deviation $\sigma_t$ according to the nice property: |
321 | 321 |
|
322 | 322 | $$ |
323 | 323 | \begin{aligned} |
|
340 | 340 |
|
341 | 341 | Let $\sigma_t^2 = \eta \cdot \tilde{\beta}_t$ such that we can adjust $\eta \in \mathbb{R}^+$ as a hyperparameter to control the sampling stochasticity. The special case of $\eta = 0$ makes the sampling process _deterministic_. Such a model is named the _denoising diffusion implicit model_ (**DDIM**; [Song et al., 2020](https://arxiv.org/abs/2010.02502)). DDIM has the same marginal noise distribution but deterministically maps noise back to the original data samples. |
342 | 342 |
|
343 | | -During generation, we don’t have to follow the whole chain $t=1,\dots,T$, but rather a subset of steps. Let’s denote $s < t$ as two steps in this accelerated trajectory. The DDIM update step is: |
| 343 | +During generation, we don't have to follow the whole chain $t=1,\dots,T$, but rather a subset of steps. Let's denote $s < t$ as two steps in this accelerated trajectory. The DDIM update step is: |
344 | 344 |
|
345 | 345 | $$ |
346 | 346 | q_{\sigma, s < t}(\mathbf{x}_s \vert \mathbf{x}_t, \mathbf{x}_0) |
@@ -426,7 +426,7 @@ They found that noise conditioning augmentation, dynamic thresholding and effici |
426 | 426 |
|
427 | 427 | Cited as: |
428 | 428 |
|
429 | | -> Weng, Lilian. (Jul 2021). What are diffusion models? Lil’Log. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/. |
| 429 | +> Weng, Lilian. (Jul 2021). What are diffusion models? Lil'Log. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/. |
430 | 430 |
|
431 | 431 | Or |
432 | 432 |
|
|
0 commit comments