Project 5

Part A

0. Setup

I set the random seed as 11220099 to generate images from the DeeoFloyd IF diffusion model. Below are examples of generated images with num_inference_steps=5 and num_inference_steps=15.

Step=5:

0.oil_paint_10.hatman_10.rocketship_1
an oil painting of a snowy mountain village (stage 1)a man wearing a hat (stage 1)a rocket ship (stage 1)
0.oil_paint_20.hatman_20.rocketship_2
an oil painting of a snowy mountain village (stage 2)a man wearing a hat (stage 2)a rocket ship (stage 2)

Step=15:

0.oilpaint_30.hatman_30.rocketship_3
an oil painting of a snowy mountain village (stage 1)a man wearing a hat (stage 1)a rocket ship (stage 1)
0.oilpaint_40.hatman_40.rocketship_4
an oil painting of a snowy mountain village (stage 2)a man wearing a hat (stage 2)a rocket ship (stage 2)

We can see that the images generated with more inference steps have significantly higher quality.

1.1 Implementing the Forward Process

To efficiently create a noised image in different time steps, I followed the formula below to do the forward process.

xt=α¯tx0+1α¯tϵ

, where ϵN(0,1), is the clean image and xt is the noised image in the t time step.

Below are examples of different noise levels.

1.1.clean1.1.2501.1.5001.1.750
cleant=250t=500t=750

1.2 Classical Denoising

The classical denoising approach which uses Gaussian blur filtering can hardly give good results. Below are examples.

1.2.2501.2.5001.2.750
t=250t=500t=750

1.3 One-Step Denoising

The pre-trained U-net which is trained to estimate the noise added in an image can denoise much better. Given the noised image at the time step t is xt with an unknown noise ϵ , the estimated clean image x^0 can be derived from the estimated noise ϵ^ :

xt=α¯tx0+1α¯tϵx^0=xt1α¯tϵ^α¯t

, where ϵ^ is the noise estimated by the U-net. The x^0 is our one-step denoising result. Below are examples.

 Cleant=250t=500t=750
Image in various noise level1.1.clean1.1.2501.1.5001.1.750
One-step denoise resultN/A1.3.2501.3.5001.3.750

1.4 Iterative Denoising

We can also use iterative denoising to get better results. I followed the provided formula to compute the estimated image in the previous time step.

xt=α¯tβt1α¯tx^0+αt(1α¯t)1α¯txt+vσ

, where t=t1 , αt=α¯tα¯t , βt=1αt , and x^0 is our current estimate of the clean image. The image gradually became less noisy during the denoising loop.

1.4.11.4.21.4.31.4.41.4.51.4.6

Compared with other approaches:

1.1.clean1.4.noisy1.4.61.4.onestep1.4.gaussian
OriginalNoisedIterative denoisedOne-step denoisedGaussian filter

1.5 Diffusion Model Sampling

Using pure noises, I sampled random images from the diffusion model with the prompt "a high quality photo."

1.5.11.5.21.5.31.5.41.5.5

1.6 Classifier Free Guidance

The Classifier-Free Guidance is to estimate two noises in every iteration. One of them ϵu is unconditional (denoised without prompt) and the other one ϵc is conditional (denoised with prompt). Then, we can derive the guided noise ϵ:

ϵ=ϵu+γ(ϵcϵu)

, where γ=7 in our setting.

I used "a high quality photo" as a conditional prompt and "" as an unconditional prompt. Below are the results.

1.6.11.6.21.6.31.6.41.6.5

This gives us much better images compared to the images from the last section.

1.7 Image-to-image Translation

I leveraged the CFG to edit the existing images. I first noised the original image in different levels, and iteratively denoised them with CFG. Below are examples.

Noise level13571020Original
Campanile1.7.11.7.21.7.31.7.41.7.51.7.61.1.clean
Professor1.7.71.7.81.7.91.7.101.7.111.7.121.7.19
UC Berkeley1.7.131.7.141.7.151.7.161.7.171.7.181.7.20

1.7.1 Editing Hand-Drawn and Web Images

I used this procedure on non-realistic images (web images and hand-drawn images).

Noise level13571020Original
Frieren1.7.211.7.221.7.231.7.241.7.251.7.261.7.27

1.7.28

1.7.29

1.7.2 Inpainting

By using mask, we can keeps some area in the images unchanged during the denoising process. Formally, the image xt at time step t is obtained by:

xtmxt+(1m)forward(x0,t)

, where m is a binary mask with values of [0,1].

1.7.30

1.7.31

1.7.32

1.7.3 Text-Conditional Image-to-image Translation

I used "a rocket ship" as a conditional prompt to translate my images into rocket ships. The process was done by using iterative CFG mentioned in section 1.6. The only thing different was the conditional prompt.

1.7.33

1.8 Visual Anagrams

To generate images look like one thing in a side and look like another thing when flipped upside down, we can denoise with the average of two noises. One of them is the "normal noise" and the other one is the "flipped noise".

ϵ1=UNet(xt,t,p1)ϵ2=flip(UNet(flip(xt),t,p2))ϵ=ϵ1+ϵ22

, where p1,p2 are the prompts.

1.8.11.8.1 copy
an oil painting of people around a campfirean oil painting of an old man
1.8.21.8.2 copy
a pencila rocket ship
1.8.31.8.3 copy
a lithograph of waterfallsa lithograph of a skull

1.9 Hybrid Images

Similarly, processing on two estimated noises and averaging them allows us to play different tricks. Therefore, performing low-pass and high-pass filtering on the noises can give us hybrid images.

ϵ1=UNet(xt,t,p1)ϵ2=UNet(xt,t,p2)ϵ=lowpass(ϵ1)+highpass(ϵ2)2
1.9.11.9.21.9.3
skull(far) + waterfall(close)a man wearing a hat
+ a rocket ship
a photo of a man
+ a photo of the amalfi coast

For the low-pass filtering, I used kernel size = 33 and sigma = 2.

Part B

1.1 Implementing the UNet

I followed the figures and instructions on the project description to implement the UNet. The architecture of the UNet is defined as follow:

2.1.1.1

2.1.1.2

1.2 Using the UNet to Train a Denoiser

To train an UNet as a denoiser, I used the gaussian noise to add noise to images. Below are some examples that add Gaussian noise to MNIST dataset.

2.1.2.1

1.2.1 Training

I trained an one-step denoiser to denoise noisy images. The noisy images were obtained by adding gaussian noise with σ=0.5 to clean images.

Hyper-parameters:

I logged the loss every 10 steps to plot the loss curve.

2.1.2.2
Loss curve
2.1.2.3
Results after training 1 epoch
2.1.2.4
Results after training 5 epochs

1.2.2 Out-of-Distribution Testing

Since my denoiser is trained on MNIST digits noised with σ=0.5 . It shouldn't perform as good when denoising the digits noised with different σ .

2.1.2.5

2.1 Adding Time Conditioning to UNet

To train a diffusion model that estimate the noise at every time step, I started with adding time conditioning to UNet (as project description suggested). This is done by adding two fully-connected layers to the UNet.

2.2.1.1

2.2.1.2

2.2 Training the UNet

I used the hyper-parameters below:

I logged the loss every 10 steps to plot the loss curve.

2.2.2.1

2.3 Sampling from the UNet

The time-conditional sampling is similar to section 1.4 in Part A.

xt=α¯tβt1α¯tx^0+αt(1α¯t)1α¯txt+vσ

I used this equation to iteratively denoise. Below are examples.

2.2.3.1
Results after training 5 epochs
2.2.3.2
Results after training 20 epochs

2.4 Adding Class-Conditioning to UNet

Class-conditioning is added to UNet by adding two fully-connected layers. During the training process, I passed class labels into the model. That is, ϵ^=UNet(xt,t,c), where t is the time step and c is the class label. Also, the class labels were set to 0 with 10% probability to make the UNet to work in unconditional cases.

Training hyper-parameters:

I logged the loss every 10 steps to plot the loss curve.

2.2.4.1

2.5 Sampling from the Class-Conditioned UNet

The sampling was very similar to section1.6 in Part A. Using CFG to guide the model do the class-conditioned estimates.

2.2.5.1
Results after training 1 epoch
2.2.5.2
Results after training 5 epochs
2.2.5.3
Results after training 20 epochs