Controllable Text-To-Image (T2I) generation has always been a major challenge in diffusion models. On the one hand, people hope that the generated images can follow some predefined physical attributes, such as the number, position, size, and texture of objects. On the other hand, they also require the T2I models to retain a certain level of creativity.
At present, there are quite a lot of researches related to controllable T2I generation. I prefer to divide them into two categories: one primarily focuses on correcting the generation path in inference, called Explicit Control; the other one strengthens the network through fine-tuning or adding new layers, called Implicit Control.
Guided Generation Hybrid-condition by Fine-Tuning
StabilityAI has recently open sourced a series of foundational models for image generation, called Stable Diffusion. Although we know these models are based on latent diffusion, there are few reports mention their detailed designs. To facilitate better understanding and potential future improvement, this blog provide some information about the designs of Unet and VAE, which are key components of the magic generation.
Unet Fig. 1: Overall of the Unet in Stable Diffusion 1.
DDRM -> Bahjat Kawar, et al. NeurIPS, 2022. Illustration of DDRM (source from paper) Transformation via SVD Similar to SNIPS, DDRM consider the singular value decomposition (SVD) of the sampling matrix $H$ as follows:
$$ \begin{aligned} y&=Hx+z\ y&=U\Sigma V^\top x+z\ \Sigma^{†} U^{\top}y&=V^\top x+\Sigma^{†} U^{\top}z\ \bar{y}&=\bar{x}+\bar{z}\ \end{aligned} $$
Since $U$ is orthogonal matrix, we have $p(U^\top z) = p(z) = \mathcal{N}(0,\sigma^2_y I)$, resulting $\bar{z}^{(i)}=(\Sigma^{†} U^{\top}z)^{(i)} \sim \mathcal{N}(0, \frac{\sigma^2_y}{s_i^2}I)$. So after these, we transform $x$ and $y$ into the same field (spectral space), and these two only differ by the noise $\bar{z}$, which can be drawn as follows:
“An inverse problem seeks to recover an unknown signal from a set of observed measurements. Specifically, suppose $x\in R^n$ is an unknown signal, and $y\in R^m = Ax+z$ is a noisy observation given by m linear measurements, where the measurement acquisition process is represented by a linear operator $A\in R^{m\times n}$, and $z\in R^n$ represents a noise vector. Solving a linear inverse problem amounts to recovering the signal $x$ from its measurement $y$.