Yu's MemoCapsule

Text2Image

Generating Images Like Texts

Can we generate images in the same way as autoregressive language model? Although this sounds simpler than diffusion models, we still need to deal with many computational cost problems. But don’t worry too much, there are serval brilliant methods to try to make this idea more competitive. Taming Transformer -> Patrick Esser, et al. CVPR 2021 The key challenge of autoregressive generation is how to solve the quadratically increasing cost of image sequences that are much longer than texts.

Controllable Text-To-Image Diffusion Models —— Explicit Control

Controllable Text-To-Image (T2I) generation has always been a major challenge in diffusion models. On the one hand, people hope that the generated images can follow some predefined physical attributes, such as the number, position, size, and texture of objects. On the other hand, they also require the T2I models to retain a certain level of creativity. At present, there are quite a lot of researches related to controllable T2I generation. I prefer to divide them into two categories: one primarily focuses on correcting the generation path in inference, called Explicit Control; the other one strengthens the network through fine-tuning or adding new layers, called Implicit Control.

Network Design in Stable Diffusion

StabilityAI has recently open sourced a series of foundational models for image generation, called Stable Diffusion. Although we know these models are based on latent diffusion, there are few reports mention their detailed designs. To facilitate better understanding and potential future improvement, this blog provide some information about the designs of Unet and VAE, which are key components of the magic generation. Unet Fig. 1: Overall of the Unet in Stable Diffusion 1.