That's really awesome, but I want to ask some questions?

What is needed for this to work? We have initial prompt, resolution, seed, scale, steps, sampler, and resulting image of course. Then we somehow fixate general composition and change prompt, but leave everything else intact? So the most important elements are prompt and
resulting image?

Can we take non-generated picture, write some "original" prompt and associatiate them with each other, then change prompt and expect that it will work? But what with all other parameters...

Or this is what will be achieved in img2img?

Or maybe I completely wrong and it's working in absolutely different ways?

First question: Yes, right now the control mechanisms are really basic, you have a initial prompt (that you can generate to see what the image looks like), then a second prompt that is an edit of the first. The algorithm will generate your second prompt so that it looks as "close" as possible to the first (with the concept of closeness being encoded inside of the network). You can also tweak the weights of each token, such that you can reduce or increase its contribution on the final image (e.g you want less clouds, more trees). Note that tweaking the weights in attention space gives much better results than editing the prompt embeddings, as the prompt embeddings are highly nonlinear and often editing them will break the image.

Second question: Yes, but not right now. What everyone is using as "img2img" is actually a crude approximation of the correct "inverse" process for the network (not to be confused with textual inversion). What we actually want for prompt editing is not to add random noise to an image but find which noise will reconstruct our intended image and use that to modify our prompt or generate variations. I was hoping someone would have already implemented it but I guess I can give it a try when I have more time.

Also, because stable diffusion is slightly different to what I guess was Imagen used in the paper, we have a second self-cross-attention layer, which can be controlled by using an additional mask (that is not yet implemented right now), that means that if image inversion is implemented correctly, we could actually "inpaint" using the cross-attention layers themselves and modify the prompt, this should give us much better results than simply masking out the image and adding random noise...

Regarding point 2 here, is this as simple as running a sampler "backwards"? I made a hacky attempt at modifying the k_euler sampler to run backwards, like so:

s_in = x.new_ones([x.shape[0]])
sigmas = denoiser.get_sigmas(50).flip(0)
for i in range(1, len(sigmas)):
x_in = torch.cat([x] * 1)
sigma_in = torch.cat([sigmas[i] * s_in] * 1)
cond_in = torch.cat([uncond])
c_out, c_in = [K.utils.append_dims(k, x_in.ndim) for k in denoiser.get_scalings(sigma_in)]
t = denoiser.sigma_to_t(sigma_in)
with autocast('cuda'):
eps = model.apply_model(x_in * c_in, t, cond=cond_in)
denoised = x_in + eps * c_out
d = (x_in - denoised) / sigma_in
dt = sigmas[i] - sigmas[i - 1]
x = x + d * dt

...and indeed, if I run a txt2img with the output of this as the initial code (i.e. initial latent) I get something that looks a lot like (a somewhat blurry version of) the image I started with (i.e. input into the code above). Not sure if I did this right or if it just happens to "look right" because I added an insufficient amount of noise to the initial image (so that there's still a lot of it "left" in the output of the above code).

This might be how they did inversion in the DDIM paper, but I couldn't find the exact method except a vague description of the inverse process "by running the sampler backwards" just like you described.

Edit to quote the paper: "...can encode from x0 to xT (reverse of Eq. (14)) and reconstruct x0 from the resulting xT (forward of Eq. (14))", page 9 section 5.4

Played around with this a bit more – if I do the noising with 1000 steps (i.e. the number of training steps, instead of 50 above), I get an output which actually "looks like" random noise (and has a standard deviation of 14.6 ~= sigma[0]) but which if used as starting noise for an image generation (without any prompt conditioning and with around 50 sampling steps) actually recreates the original image pretty well (and it's not blurry as when I used 50 steps in the noising)!

Not sure why it's so blurry when I use only 50 steps instead of 1000 to noise it, I'd expect the sampler to be able to approximate the noise using just a few dozen steps roughly as well as it's able to approximate the image when run in the "normal direction". The standard deviation of the noise is only around 12.5 or so when I use 50 steps instead of 1000, so maybe I have an off-by-one error or something somewhere that results in too little noise being added.

Great, that's exactly what the authors observed in the DDIM paper! If you don't mind, you are free to setup a quick demo with maybe one or two examples and push it to the github, that would be super cool for everyone to use!

Edit: And for the reason behind why 50 steps doesn't work as well, I guess maybe is that the forward process uses many tricks for acceleration while the inverse process was pretty much neglected and was not optimized (remember the first paper on diffusion models actually needed 1000 sampling steps too for good results), so you actually need to perform the diffusion correctly, for now (eg. 1000 steps).

Yeah, I'm generating a few examples now, and I'll post something in this subreddit and some code on Github later tonight. I didn't actually try your cross attention control code yet, I'll have to do that as well and see how all this fits together. :)

Wonder if the inversion code could be used to style transfer like in https://github.com/justinpinkney/stable-diffusion . Take clip1 embedding from image1, reconstruct the noise1, take image 2, find clip2, and recreate from noise1 to get a style2 result. Still only just read about it so haven't thought it through, but the reconstruction idea seemed very useful. I will think about it but i'm not sure i'm up to the task of coding it up/trying it out myself

10

## u/Zertofy Sep 10 '22

That's really awesome, but I want to ask some questions?

What is needed for this to work? We have initial prompt, resolution, seed, scale, steps, sampler, and resulting image of course. Then we somehow fixate general composition and change prompt, but leave everything else intact? So the most important elements are prompt and resulting image?

Can we take non-generated picture, write some "original" prompt and associatiate them with each other, then change prompt and expect that it will work? But what with all other parameters...

Or this is what will be achieved in img2img?

Or maybe I completely wrong and it's working in absolutely different ways?