diffusion_models - Mora Project

# Diffusion models Most modern AI image generators, including [[stable_diffusion|Stable Diffusion]], [[midjourney|Midjourney]], and [[flux1|FLUX.1]], rely on **diffusion models**. At their core, these models solve a specific problem: *how to transform pure, random noise into a coherent, structured image*. Unlike earlier generative approaches like [[GANs]], which try to produce an image in one shot, diffusion is an iterative process. It builds the image step-by-step, denoising more and more until it reaches the last step. ^overview ## Forward and Reverse Process ^02b04b "Diffusion" is borrowed directly from thermodynamics. In physics, diffusion describes how particles spontaneously migrate from areas of high concentration toward equilibrium. Entropy always wins. Given enough time, organized drops of dye in a glass of water will disperse into a uniform, featureless blur. To the naked eye, this process appears irreversible, but the *maths* behind it are not. If you can formally write down how structure decays into randomness, ==you can, in principle, run that process in reverse==. ^diffusion-definition That leap was formalized for machine learning by [[Sohl-Dickstein et al. (2015)|Sohl-Dickstein et al.]] in 2015, and refined in 2020 by [[Ho et al. (2020)|Ho et al.]] with **Denoising Diffusion Probabilistic Models (DDPMs)**. Their central question: *What if a neural network could learn to reverse diffusion; to reconstruct order from chaos?* In practice, this means the model operates in two opposed directions: ^order-from-chaos 1. **Forward Diffusion (Training)**: The model takes a clear image and slowly corrupts it by adding [[Gaussian noise]] over many steps, until it becomes unrecognizable static. The AI watches this process and learns exactly *how* the noise was added at each step. ^forward-process 2. **Reverse Diffusion (Generation)**: The model starts with pure static and tries to predict, step-by-step, `what noise was added here?`. By subtracting its predicted noise, it recovers a slightly cleaner image. It repeats this 20 to 50 times (steps) until a crisp image emerges. ^reverse-process This iterative nature makes diffusion robust. It doesn't have to get the whole image right instantly; it just needs to make the image *slightly less noisy* than it was a moment ago. ## Understanding Steps Diffusion doesn't treat all steps equally. The model allocates different conceptual work to different stages of the process. - **Early Steps (0–40%)**: Determine broad composition: shapes, layout, color balance. The model is "deciding" what goes where. - **Mid Steps (40–70%)**: Refine objects and their relationships. Forms are locked; details emerge. - **Late Steps (70–100%)**: Polish texture, lighting, and fine details. The semantic structure is frozen. This is why adjusting a prompt mid-generation ([[prompt_scheduling|scheduling]]) or emphasizing certain terms early ([[prompt_weighting|weighting]]) can override other concepts entirely. ## Conditioning and Guidance A **text encoder** (like [[CLIP]] or [[T5]]) translates your words into embeddings that the diffusion model understands. As the model denoises the static, it constantly "looks" at these text embeddings via mechanisms like cross-attention to decide *which* shapes to pull out of the noise. If you ask for a `cat,` it steers the denoising process toward cat-like features and away from everything else. ^overview-conditioning