CS 180 Project 5: Fun With Diffusion Models!

By Ediz Ertekin Jr.

Part A: The Power of Diffusion Models!

Introduction

For project 5a, I opportunity to explore applying noise to images as well as various methods of denoising. I then got then worked with diffusion models to generate samples from a given image as well as a blank noisy canvas. Next I looked into Image-to-image translation for web images as well as hand drawn images. After that I applied inpainting to images given a mask. Finally, I explored visual anagrams and hybrid images.

Part 0. Setup

In part 0, I downloaded a DeepFloyd as well as text embeddings. DeepFloyd is a two stage model trained by Stability AI. Below are a few results with differing numbers of iteratives steps which as a result alter the sample output.

vill
"An oil painting of a snowy village"
20 iterations
Guy
"A man wearing a hat"
20 iterations
rocket
"A rocket ship"
20 iterations
vill
"An oil painting of a snowy village"
40 iterations
Guy
"A man wearing a hat"
40 iterations
rocket
"A rocket ship"
40 iterations
vill
"An oil painting of a snowy village"
200 iterations
Guy
"A man wearing a hat"
200 iterations
rocket
"A rocket ship"
200 iterations

Part 1. Sampling Loops

In this part I wrote my own "sampling loops" that use the pretrained DeepFloyd denoisers. First I implemented the forward process which applied noise to the image. I then implemented various methods of denoising such as classical denoising, one-step denoising, and iterative.

1.1 Implementing the Forward Process

Below I generate a noisy image at timestep t (250,500,750) by sampling from a Gaussian.

original
Original Campanile
t1
Noisy Campanile at t = 250
t2
Noisy Campanile at t = 500
t3
Noisy Campanile at t = 750

1.2 Classical Denoising

Next I applied a Gaussian blur filtering to try to remove the noise from the image. However, the results are not great.

t1
Noisy Campanile at t = 250
t2
Noisy Campanile at t = 500
t3
Noisy Campanile at t = 750
t1
Gaussian Blur Denoising at t = 250
t2
Gaussian Blur Denoising at t = 500
t3
Gaussian Blur Denoising at t = 750

1.3 One-Step Denoising

For this part I use the pretrained diffusion model to denoise the noisy images computed in 1.1. I estimate the noise in the new noisy image and then denoise it using the diffusion model to obtain an estimate of the original image.

t1
Noisy Campanile at t = 250
t2
Noisy Campanile at t = 500
t3
Noisy Campanile at t = 750
t1
One-Step Denoised Campanile at t = 250
t2
One-Step Denoised Campanile at t = 500
t3
One-Step Denoised Campanile at t = 750

1.4 Iterative Denoising

Although the one-step denoising does a decent job for denoising. However, diffusion models are meant to denoise iteratively, but we dont want it to be slow so we can skip steps to denoise faster. To skip steps I created a new list of timesteps called strided_timesteps. I then denoise the image at each timestep in strided_timesteps using the model. Below are the results from every fifth timesteps in strided_timesteps. As well as the final clean image, the original campanile, one-step, and the gaussian blur denoised images.

t1
Noisy Campanile at t=90
t2
Noisy Campanile at t=240
t3
Noisy Campanile at t=390
t4
Noisy Campanile at t=540
t5
Noisy Campanile at t=690
t1
Original
t2
Iteratively Denoised Campanile
t3
One-Step Denoised Campanile
t4
Gaussian Blurred Campanile

1.5 Diffusion Model Sampling

Next we use the diffusion model to create images from scratch by passing in a blank canvas with the prompt "a high quality photo" for the embeddings.

t1
Sample 1
t2
Sample 2
t3
Sample 3
t4
Sample 4
t5
Sample 5

1.6 Classifier-Free Guidance (CFG)

From section 1.5, we can see that some of the images in the prior section are not very good, and some are completely non-sensical. In order to improve these results we can use classifier-free guidance. I computed a conditional and an unconditional noise estimate to generate a new estimate using a variable that controls the strength of CFG. For the conditional prompt I used "a high quality photo" and for the unconditional prompt I used "". Below are 5 images using the prompt "a high quality photo" with a CFG scale of 7.

t1
Sample 1 with CFG
t2
Sample 2 with CFG
t3
Sample 3 with CFG
t4
Sample 4 with CFG
t5
Sample 5 with CFG

1.7 Image-to-image Translation

The more noise we add, the larger the edit will be however this results in the model needing to hallucinate. I took the original test image, applied the noise function to it, and force it back onto the image manifold without any conditioning. Below I applied this translation to the image of the campanile as well as two images of my own.

t1
i_start = 1
t2
i_start = 3
t3
i_start = 5
t4
i_start = 7
t5
i_start = 10
t6
i_start = 20
t7
Campanile
t1
i_start = 1
t2
i_start = 3
t3
i_start = 5
t4
i_start = 7
t5
i_start = 10
t6
i_start = 20
t7
Beanie Boy
t1
i_start = 1
t2
i_start = 3
t3
i_start = 5
t4
i_start = 7
t5
i_start = 10
t6
i_start = 20
t7
Bubble Skater

1.7.1 Editing Hand-Drawn and Web Images

Next I apply the above function to hand drawn images as well as a image from the web of a porsche. The first image is of an avocado, then I drew a flower and a ball guy with glasses.

t1
i_start = 1
t2
i_start = 3
t3
i_start = 5
t4
i_start = 7
t5
i_start = 10
t6
i_start = 20
t7
Avocado
t1
i_start = 1
t2
i_start = 3
t3
i_start = 5
t4
i_start = 7
t5
i_start = 10
t6
i_start = 20
t7
Drawn Flower
t1
i_start = 1
t2
i_start = 3
t3
i_start = 5
t4
i_start = 7
t5
i_start = 10
t6
i_start = 20
t7
Drawn Ball Guy

1.7.2 Inpainting

Next I applied this function to inpaint an image. This means I took an image, along with a mask, which is used to show what part of the image can be replaced/altered. Then I apply the same function as above to the image while retaining the image outside of the mask. Below are a few examples, the first is changing the top of the campanile, the second is changing the hat of a cartoon boy, and the third is replacing the porsche with a soccer ball. For the first 2 I used the prompt "a high quality image" and for the last I used "a soccer ball" just to try something different.

t5
Original
t6
Mask
t7
Hole to fill
t7
Campanile Inpainted
t5
Original
t6
Mask
t6
Hole to fill
t7
Beanie Boy Inpainted
t5
Porsche
t6
Mask
t6
Hole to fill
t7
Porsche Inpainted with Soccer Ball Prompt

1.7.3 Text-Conditional Image-to-image Translation

Next I use the same function as above but I guide the projection with a text prompt. I used the prompt "a rocket ship" on various images. I inputted a picture of the iron giant in an attempt to create ironman as well as a picture of a car to create a rocket car. Below are my results.

t5
Rocket Ship at noise level 1
t6
Rocket Ship at noise level 3
t7
Rocket Ship at noise level 5
t7
Rocket Ship at noise level 7
t7
Rocket Ship at noise level 10
t7
Rocket Ship at noise level 20
t7
Campanile
t5
Rocket Ship at noise level 1
t6
Rocket Ship at noise level 3
t7
Rocket Ship at noise level 5
t7
Rocket Ship at noise level 7
t7
Rocket Ship at noise level 10
t7
Rocket Ship at noise level 20
t7
Iron Giant
t5
Rocket Ship at noise level 1
t6
Rocket Ship at noise level 3
t7
Rocket Ship at noise level 5
t7
Rocket Ship at noise level 7
t7
Rocket Ship at noise level 10
t7
Rocket Ship at noise level 20
t7
Porsche

1.8 Visual Anagrams

Next I applied this functionality to create visual anagrams. In order to do this we pass in two prompts and apply the above functionality with prompt 1 on the original image and then flip the image and apply prompt 2 to it. I then average the two noise estimates to create a new image. Below are a few examples of this and I will attach the prompts as well.

t5
"an oil painting of an old man"
t6
"an oil painting of people around a campfire"
t6
"a photo of a man"
t5
"a photo of the amalfi coast"
t5
"an oil painting of a dog"
t6
"an oil painting of a snowboarder"

1.9 Hybrid Images

The last part of part A is to use the above function to generate hybrid images, similar to what we did in project 2. In order to create hybrid images compute the noise estimate, by calculating the noise for two different prompts, and then apply a low pass filter on the first image and a high pass filter on the second image. The result would be a hybrid image using the two prompts that we used. Below are a few examples, and I will attach the prompts as well.

t5
Hybrid image of a skull and a waterfall
t5
Hybrid image of a soccerball and a dog
t5
Hybrid image of a dog and a snowboarder

Part B: Diffusion Models from Scratch!

Introduction

For part B of this project I trained a UNet model that iteratively denoises an images from the MNIST dataset. I used the algorithms defined in the research paper "Denoising Diffusion Probabilistic Models" by Jonathan Ho, Ajay Jain, and Pieter Abbeel.

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

For the first part I built a simple one-step denoiser where I optimized over the L2 loss. The denoiser is implemented as a UNet, which contains downsampling and upsampling blocks with skip connections. Below is the architecture of the UNet.

graph
Unconditional UNet

1.2 Using the UNet to Train a Denoiser

Below is a graph of digits from the MNIST dataset with the noisy function applied at different sigma levels including sigma = [0.0, 0.2, 0.4, 0.6, 0.8. 1.0].

noise
Varying levels of noise on MNIST digits

1.3 Training

In order to train my model to denoise noisy images, I used sigma = 0.5, and trained for 5 epochs. I used the UNet architecture defined in section 1.1 with a hidden dimension of 128. For the optimizer I used the Adam optimizer with learning rate of 1e-4. Below is a graph of my training loss.

curve
Training Loss Curve
e1
Results on digits from the test set after 1 epoch of training
e5
Results on digits from the test set after 5 epochs of training

1.4 Out-of-Distribution Testing

The denoiser was trained on sigma = 0.5, so we also want to test how this model performs on images with different noise levels. Below are the results on digits from the test set with varying noise levels.

out_of_sys
Results on digits from the test set with varying noise levels including sigma = [0.0, 0.2, 0.4, 0.6, 0.8. 1.0].

Part 2: Training a Diffusion Model

2.1 Adding Time Conditioning to UNet

For the next part we transform the uncondtioned UNet into a conditioned UNet on timestep variable t. I injected the scalar t into the UNet model to condition it, and then trained the model to predict a denoised image. Below is the architecture of the conditioned UNet.

graph
Conditioned UNet

2.2 Training the UNet

In order to train the model, I pick a random image from the training set, a random timestep t, and train the denoiser to predict the noise. I use the MNIST dataset for the training set with a batch size of 128 and train for 20 epochs. As mentioned in the instructions I normalize t based on T which is the total number of timesteps(300). I used the Adam optimizer with a learning rate of 1e-3. I also call scheduler.step() after every epoch. I used the time-conditioned UNet architecture defined in section 2.1 with a hidden dimension of 64. Below is a graph of my training loss.

curve
Training Loss Curve

2.3 Sampling from the UNet

I used a sampling process similar to the DeepFloyd model from part A, except rather than generating the variance I use the beta list. I also generate a blank noisy canvas to produce the sample. Below are samples generated after the 5th and 20th epochs of training. I use the same random seed in order to produce the same numbers, as you can see below the digits from epoch 20 look slightly better than those from epoch 5.

e5
Results after 5 epochs of training
e5
Results after 20 epochs of training

2.4 Adding Class-Conditioning to UNet

For the last model, in order to make the results better and have more control over the generated images, I implemented class conditioning on the UNet for digits 0 through 9. For the conditioning I used a one-hot encoding vector instead of a single scalar. I used the same architecture as part 2.2 except with the addtional condition variable.

curve
Training Loss Curve

2.5 Sampling from the Class-Conditioned UNet

Similar to part 2.3, I generated sampling results for the class-conditioned UNet for 5 and 20 epochs. However, rather than generating random numbers I generate 4 instances of each digit from 0 to 9. Below are the results after 5 and 20 epochs of training. I believe that the samples generated for epoch 20 are better than those generated for epoch 5.

e5
Results after 5 epochs of training
e5
Results after 20 epochs of training