Beginners Guide to How Stable Diffusion Works (Illustrative)
After experimenting with AI image generation, you may start to wonder how it works.
What is Diffusion Anyway?
Something especially fascinating happens between steps 2 and 4 in this case. It’s as if the outline emerges from the noise.
How diffusion works
Painting images by removing noise
Speed Boost: Diffusion on Compressed (Latent) Data Instead of the Pixel Image
The Text Encoder: A Transformer Language Model
|Larger/better language models have a significant effect on the quality of image generation models. Source: Google Imagen paper by Saharia et. al.. Figure A.5.|
The early Stable Diffusion models just plugged in the pre-trained ClipText model released by OpenAI. It’s possible that future models may switch to the newly released and much larger OpenCLIP variants of CLIP (Nov2022 update: True enough, Stable Diffusion V2 uses OpenClip). This new batch includes text models of sizes up to 354M parameters, as opposed to the 63M parameters in ClipText.
How CLIP is trained
Feeding Text Information Into The Image Generation Process
Layers of the Unet Noise predictor (without text)
Layers of the Unet Noise predictor WITH text
Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/