How Diffusion Models Work: The Physics of AI Video Generation

The transformation of a chaotic, high-entropy noise field into a coherent, 60-frame-per-second video is not magic; it is the iterative application of a learned vector field. While modern video models like WAN 2.1 appear to “create” from nothing, they are actually performing a sophisticated form of reverse-time navigation through a high-dimensional manifold.

The Physics of Denoising: Brownian Motion in Reverse

At the architectural level, diffusion models are essentially solvers for a stochastic differential equation (SDE). During training, we subject clean data to a forward diffusion process—a controlled injection of Gaussian noise that mimics Brownian motion. This pushes our data points into a state of maximum entropy.

The model’s objective is to learn the “score function,” a vector field that points toward the manifold of realistic data. When we run this process in reverse, we are not simply “removing noise.” We are using the transformer to predict the gradient of the log-density of our data distribution.

Content hosted by YouTube

Content is not loaded until you have given consent.

Manage preferences

Watch on YouTube: https://youtube.com/watch?v=iv-5mZ_9CPY

In practice, the naive approach—denoising one step at a time—is computationally prohibitive. Modern implementations utilize Ordinary Differential Equations (ODEs), such as those found in DDIM (Denoising Diffusion Implicit Models), to traverse this space deterministically. By treating the generation process as a flow, we can bypass the random noise injection required by earlier DDPM approaches, resulting in sharper outputs and significantly reduced inference latency.

CLIP and the Geometry of Concepts

The bridge between raw pixel-space diffusion and human intent is the CLIP (Contrastive Language-Image Pre-training) embedding space. CLIP maps text and images into a shared 512-dimensional vector space using a contrastive loss objective. By maximizing cosine similarity between matching pairs and minimizing it for non-matching ones, the model learns a geometry where semantic concepts—like “hat” or “astronaut”—become measurable distances.

However, CLIP is a discriminator, not a generator. To bridge this, we use conditioning. By injecting these semantic vectors into the transformer’s cross-attention layers, we bias the denoising process. The model no longer just moves toward “any” realistic image; it moves toward the specific region of the manifold defined by the text embedding.

Classifier-Free Guidance: Amplifying the Signal

Conditioning alone often results in “mean-seeking” behavior—the model produces a blurry, average version of the requested concept. To force the model to adhere strictly to the prompt, we employ Classifier-Free Guidance (CFG).

This is a clever mathematical hack. During inference, we perform two forward passes: one conditioned on the text prompt and one unconditioned (or “null”). We calculate the difference between these two vector fields and amplify it by a scaling factor ($\alpha$).

$v_{guided} = v_{uncond} + \alpha(v_{cond} - v_{uncond})$

This effectively pushes the generation trajectory away from the “average” (the unconditioned path) and toward the specific semantic target. It is why modern models can handle negative prompts—by subtracting the vector field of unwanted features (e.g., “extra fingers”) from the conditioned output, we sharpen the model’s focus on the desired output.

The Future of Latent Trajectories

We are moving past the era where diffusion models were viewed as simple image-to-image filters. We are now treating them as dynamic, time-varying vector fields. The transition from stochastic, noise-heavy sampling to deterministic flow-matching represents a fundamental shift in how we perceive generative AI.

The implication is clear: the bottleneck is no longer the model’s ability to “imagine,” but our ability to navigate the latent space with precision. As we refine our control over these vector fields—using techniques like negative prompt subtraction and dynamic guidance scaling—we are effectively turning the “black box” of neural generation into a programmable engineering tool. The next frontier isn’t just bigger models; it’s more efficient, deterministic control over the flow of high-dimensional data.

Sources

https://www.youtube.com/watch?v=iv-5mZ_9CPY

What is Grokking The Geometry of LLM Intelligence

The Physics of Denoising: Brownian Motion in Reverse

CLIP and the Geometry of Concepts

Classifier-Free Guidance: Amplifying the Signal

The Future of Latent Trajectories

Sources

Related Notes