r/StableDiffusion Sep 10 '22

Prompt-to-Prompt Image Editing with Cross Attention Control in Stable Diffusion


View all comments


u/Zertofy Sep 10 '22

That's really awesome, but I want to ask some questions?

What is needed for this to work? We have initial prompt, resolution, seed, scale, steps, sampler, and resulting image of course. Then we somehow fixate general composition and change prompt, but leave everything else intact? So the most important elements are prompt and resulting image?

Can we take non-generated picture, write some "original" prompt and associatiate them with each other, then change prompt and expect that it will work? But what with all other parameters...

Or this is what will be achieved in img2img?

Or maybe I completely wrong and it's working in absolutely different ways?


u/bloc97 Sep 10 '22

First question: Yes, right now the control mechanisms are really basic, you have a initial prompt (that you can generate to see what the image looks like), then a second prompt that is an edit of the first. The algorithm will generate your second prompt so that it looks as "close" as possible to the first (with the concept of closeness being encoded inside of the network). You can also tweak the weights of each token, such that you can reduce or increase its contribution on the final image (e.g you want less clouds, more trees). Note that tweaking the weights in attention space gives much better results than editing the prompt embeddings, as the prompt embeddings are highly nonlinear and often editing them will break the image.

Second question: Yes, but not right now. What everyone is using as "img2img" is actually a crude approximation of the correct "inverse" process for the network (not to be confused with textual inversion). What we actually want for prompt editing is not to add random noise to an image but find which noise will reconstruct our intended image and use that to modify our prompt or generate variations. I was hoping someone would have already implemented it but I guess I can give it a try when I have more time.

Also, because stable diffusion is slightly different to what I guess was Imagen used in the paper, we have a second self-cross-attention layer, which can be controlled by using an additional mask (that is not yet implemented right now), that means that if image inversion is implemented correctly, we could actually "inpaint" using the cross-attention layers themselves and modify the prompt, this should give us much better results than simply masking out the image and adding random noise...

Exciting times ahead!