Can we generate images in the same way as autoregressive language model?
Although this sounds simpler than diffusion models, we still need to deal with many computational cost problems. But don’t worry too much, there are serval brilliant methods to try to make this idea more competitive.
Taming Transformer -> Patrick Esser, et al. CVPR 2021
The key challenge of autoregressive generation is how to solve the quadratically increasing cost of image sequences that are much longer than texts.
Controllable Text-To-Image (T2I) generation has always been a major challenge in diffusion models. On the one hand, people hope that the generated images can follow some predefined physical attributes, such as the number, position, size, and texture of objects. On the other hand, they also require the T2I models to retain a certain level of creativity.
At present, there are quite a lot of researches related to controllable T2I generation. I prefer to divide them into two categories: one primarily focuses on correcting the generation path in inference, called Explicit Control; the other one strengthens the network through fine-tuning or adding new layers, called Implicit Control.
Guided Generation Hybrid-condition by Fine-Tuning
StabilityAI has recently open sourced a series of foundational models for image generation, called Stable Diffusion. Although we know these models are based on latent diffusion, there are few reports mention their detailed designs. To facilitate better understanding and potential future improvement, this blog provide some information about the designs of Unet and VAE, which are key components of the magic generation.
Unet Fig. 1: Overall of the Unet in Stable Diffusion 1.
DDRM -> Bahjat Kawar, et al. NeurIPS, 2022. Illustration of DDRM (source from paper) Transformation via SVD Similar to SNIPS, DDRM consider the singular value decomposition (SVD) of the sampling matrix $H$ as follows:
$$ \begin{aligned} y&=Hx+z\ y&=U\Sigma V^\top x+z\ \Sigma^{†} U^{\top}y&=V^\top x+\Sigma^{†} U^{\top}z\ \bar{y}&=\bar{x}+\bar{z}\ \end{aligned} $$
Since $U$ is orthogonal matrix, we have $p(U^\top z) = p(z) = \mathcal{N}(0,\sigma^2_y I)$, resulting $\bar{z}^{(i)}=(\Sigma^{†} U^{\top}z)^{(i)} \sim \mathcal{N}(0, \frac{\sigma^2_y}{s_i^2}I)$. So after these, we transform $x$ and $y$ into the same field (spectral space), and these two only differ by the noise $\bar{z}$, which can be drawn as follows: