Exploring Stable Diffusion
I've began my journey into AI land by studying how Stable Diffusion works. Read on to learn more.
This year I decided to begin my journey into the land of AI by exploring Stable Diffusion and the magic behind diffusion models.
Why start with diffusion models where the hype these days revolves around Large Language Models? Because I’m still fascinated with Computer Vision.
My first serious foray into the land of AI a few years ago began with Convolutional Neural Networks for Computer Vision, so it’s fitting to resume this journey via a similar route.
I will certainly study LLMs and natural language programming in the near future, but for now my focus is turned towards vision.
What is Stable Diffusion
Stable Diffusion is an innovative deep learning model that can generate stunning images from a textual description (text-to-image mode) or a reference image (image-to-image mode). It’s able to perform such feats by using a technique called latent diffusion.
In the context of images, latent diffusion is a very interesting technique used to progressively generate the final image by successively removing noise from an original image that is pure noise. The term latent refers to the fact that this denoising process occurs in what is called a latent space, which is a lower dimensional space than the pixel space of images.
You can think of the latent space as a compressed representation of the target image. Since the model mainly operates in the latent rather than the pixel space, memory and compute costs are significantly reduced, resulting in fast image generation. For instance, the original Stable Diffusion took as input images with dimensions 512 x 512 x 3 (width x height x RGB pixels), but operated in latent space with dimensions 64 x 64 x 4. This means the memory requirements in the latent space are (512 / 64) x (512 / 64) x (3 / 4) = 48 times less than in pixel space.
Stable Diffusion isn’t the only popular model to leverage latent diffusion techniques. Indeed, other closed source models like DALLE-2/3 and MidJourney use similar techniques to produce images that align with the input prompt.
The maths behind diffusion models is not trivial. In fact I’m still drowning in it 🫠, but I hope that over time I’ll develop a better intuition for what is going on in all these formulas and equations.
How I’m learning about Stable Diffusion
At the moment I’m taking part 2 of Fast AI’s 2022 course, which covers Stable Diffusion in depth. Given how intense the course is I wish I had more time to dedicate to the study of such an important model architecture.
On the experimentation side, I’ve managed to replicate a couple of Google Colab notebooks. I understand the components that make up the text-to-image and image-to-image pipeline but I’m still digging into the library code to improve my intuition on how everything works.
I also want to develop a better flair for writing good prompt that produce great images. The only way to get such skills is to practice creating lots of images and learn from other people’s prompts and model hyper-parameters.
While all the work so far has been done on Google Colab notebooks running on remote virtual machines with modest GPUs, I’m planning to revive my desktop computer to run Stable Diffusion on my GTX 1080 Ti. It’s not the most powerful consumer GPU anymore but it will do for my experiments.
I’ve also installed Diffusion Bee to generate images on my MacBook Pro M1 Max. So far I’ve been disappointed with the images generated, even after I load more powerful models into it (maybe there’s a bug because it doesn’t look like it updates the model somehow?).
I’ve also already experimented with the more recent Stable Diffusion XL and even dabbled into Stable Video Diffusion! These models are fascinating and even more powerful. I will certainly do more experimentation with them but for now I will focus on the original Stable Diffusion.
Closing thoughts
Diffusion models are a very interesting area of research. I suspect that they will also become important in language generation. Moreover, I reckon they will also be great for generating synthetic data.
While I’m curious about Stable Diffusion and diffusion models for vision in general, I’m not sure yet what kind of project I will build with this technology. I may do something for fun rather than profit in this space, but maybe I will change my mind given the potentially high GPU costs I would incur (compute ain’t cheap!).
I’ll continue reporting back on my adventures in the provinces of Diffusion models.
Stay tuned 🎬