r/StableDiffusion Sep 11 '22

A better (?) way of doing img2img by finding the noise which reconstructs the original image Img2Img

Post image
889 Upvotes

View all comments

157

u/Aqwis Sep 11 '22 edited Sep 11 '22

I’ve made quite a few attempts at editing existing pictures with img2img. However, at low strengths the pictures tend to be modified too little, while at high strengths the picture is modified in undesired ways. /u/bloc97 posted here about a better way of doing img2img that would allow for more precise editing of existing pictures – by finding the noise that will cause SD to reconstruct the original image.

I made a quick attempt at reversing the k_euler sampler, and ended up with the code I posted in a reply to the post by bloc97 linked above. I’ve refined the code a bit and posted it on GitHub here:

link to code

If image is a PIL image and model is a LatentDiffusion object, then find_noise_for_image can be called like this:

noise_out = find_noise_for_image(model, image, 'Some prompt that accurately describes the image', steps=50, cond_scale=1.0)

The output noise tensor can then be used for image generation by using it as a “fixed code” (to use a term from the original SD scripts) – in other words, instead of generating a random noise tensor (and possibly adding that noise tensor to an image for img2img), you use the noise tensor generated by find_noise_for_image_model.

This method isn’t perfect – deviate too much from the prompt used when generating the noise tensor, and the generated images are going to start differing from the original image in unexpected ways. Some experimentation with the different parameters and making the prompt precise enough will probably be necessary to get this working. Still, for altering existing images in particular ways I’ve had way more success with this method than with standard img2img. I have yet to combine this with bloc97’s Prompt-to-Prompt Image Editing, but I’m guessing the combination will give even more control.

All suggestions for improvements/fixes are highly appreciated. I still have no idea what the best setting of cond_scale, for example, and in general this is just a hack that I made without reading any of the theory on this topic.

Edit: By the way, the original image used in the example is from here and is the output of one of those old "this person does not exist" networks, I believe. I've tried it on other photos (including of myself :), so this works for "real" pictures as well. The prompt that I used when generating the noise tensor for this was "Photo of a smiling woman with brown hair".

75

u/GuavaDull8974 Sep 11 '22

This is spectacular! I made feature request for it already on webui, you think you can produce actualy working comit for it ?

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/291

13

u/hefeglass Sep 12 '22

Its been implemented by AUTOMATIC1111 but I cant seem to figure out how to use it. Anyone able to explain? I am trying to use the alternate img2img script.

19

u/jonesaid Sep 12 '22

You go to the img2img tab, select the img2img alternative test in the scripts dropdown, put in an "original prompt" that describes the input image, and whatever you want to change in the regular prompt, CFG 2, Decode CFG 2, Decode steps 50, Euler sampler, upload an image, and click generate.

2

u/Plopdopdoop Sep 13 '22 edited Sep 14 '22

So when I try those settings the output isn't anything close, like not even recognizable objects in the resulting image (original being 'picture of man wearing a read shirt').

8

u/jonesaid Sep 13 '22

It seems to be very sensitive to decode cfg and decode steps. I use decode cfg at about 2, and decode steps from 35-50. Make sure regular cfg is about 2 too.

2

u/BeanieBytes Sep 14 '22

I'm also getting this issue. does my denoising strength need to be altered?

2

u/Breadisgood4eat Sep 14 '22

I had an older install and just copied this new repo over the top and was getting the same issue. I reinstalled from scratch and now it's working.

1

u/2legsakimbo Sep 13 '22

hhmm that alternative test isnt showing up. even though i just forced updated by deleting verv and repositary folders.

I must have missed a step

3

u/tobboss1337 Sep 13 '22

You just deleted the additional python repos and environment. So you returned to the state of initial downloading but not the newest version. Did you pull the changes from Automatic's repo?

1

u/2legsakimbo Sep 13 '22 edited Sep 13 '22

no, thank you for letting me know that i have to do that.

1

u/feelosofee Sep 16 '22

What about denoising strength? Also, Euler or Euler_a ?

2

u/jonesaid Sep 16 '22

Euler, denoising strength around 0.2

5

u/redboundary Sep 11 '22

Isn't it the same as setting "masked content" to original in the img2img settings?

51

u/animemosquito Sep 11 '22

no, this is finding which "seed" basically would lead to SD generating the original image, so that you are able to modify it in less destructive ways.

23

u/MattRix Sep 11 '22

yep exactly! Though to be somewhat pedantic it’s not the seed, it’s the noise itself.

7

u/animemosquito Sep 11 '22

Yeah that's a good distinction to make, I'm trying to make it accessible and less complicated, but it's important to make the distinction that the seed is what is used to produce the initial noise, which is used to diffuse / iterate on to get to a final product

5

u/Trainraider Sep 12 '22

It's a really important distinction because there's a lot more potential entropy in the noise than in the seed. There may be a noise pattern that results in the image, but there probably isn't a seed that makes that specific noise pattern.

9

u/wildgurularry Sep 12 '22

It's true... there are only 2^32 possible seeds, but almost 2^6291456 possible noise patterns for a 512x512 image.

11

u/ldb477 Sep 14 '22

That’s at least double

1

u/Lirezh Sep 15 '22

he regular

There might be countless noise patterns in math but not in reality. The vast majority of those patterns will certainly result in identical result images which is also true for the 2^32 seed variations.
A lot of them are probably going to show the same result.

6

u/almark Sep 12 '22

this means we can keep the subject we like and alter it, move the model, poses, different things in the photo.

1

u/[deleted] Sep 12 '22

... make perfecto hands, I'd hazard a guess

3

u/almark Sep 12 '22

hands are floppy things - laughs

I still have nightmares from first glance in SD.

-1

u/ImeniSottoITreni Sep 12 '22

at noise tensor to an image for im

Isn't this the repo with outpainting? Why merge it here and not in the original webui repo?

11

u/AUTOMATIC1111 Sep 12 '22

That is the original web ui repo.

51

u/bloc97 Sep 11 '22

Awesome, I can't wait to combine this with cross attention control, this will actually allow people to edit an image however they want at any diffusion strengths! No more the problem of img2img ignoring the initial image at high strengths. I will take a look at the code tomorrow...

Also I believe (and hope) that inpainting with this method with cross attention control would yield far superior results than simply masking out parts of an image and adding random noise. What a time to be alive!

7

u/enspiralart Sep 12 '22

2 minute papers bump!

3

u/gxcells Sep 11 '22

Then you will probably update your jupyter notebook with k diffusers?

7

u/bloc97 Sep 11 '22

The current version uses k-lms by default.

2

u/gxcells Sep 12 '22

Ok, thanks a lot

9

u/no_witty_username Sep 11 '22

God speed my man. This feature is gonna be massive.

12

u/ethereal_intellect Sep 11 '22 edited Sep 11 '22

The prompt that I used when generating the noise tensor for this was "Photo of a smiling woman with brown hair".

Wait, so it take the assumed prompt as input? What if you put a wrong prompt, like a photo of a dog with brown hair. Does the learned noise overwrite the prompt and still draw a human face? I see u/JakeWilling asked basically the same too. It would/could be interesting if "close enough" descriptions from the blip+clip system work

Edit: There's also https://github.com/justinpinkney/stable-diffusion this which uses image embeddings instead of text. Wonder if it would make the reconstructions more accurate? Tho at that point you got no variables left to control lol

Edit2: Style transfer with the above might be interesting, get clip image1, get noise seed, get clip image2 and run it on the same seed

2

u/2legsakimbo Sep 13 '22

Edit: There's also https://github.com/justinpinkney/stable-diffusion this which uses image embeddings instead of text. Wonder if it would make the reconstructions more accurate? Tho at that point you got no variables left to control lol

this looks amazing

9

u/AUTOMATIC1111 Sep 12 '22

That last line in gist where you multiply by sigmas[-1] was completely destroying the picture. Don't know if you added it in jest or something but it took a lot to discover and fix it.

10

u/[deleted] Sep 11 '22

[deleted]

3

u/ByteArrayInputStream Sep 11 '22

Haven't tried it, but my guess would be that it wouldn't be able to find a seed that accurately resembles the original image

3

u/Doggettx Sep 11 '22 edited Sep 11 '22

Very cool, definitely gonna have to play with this :)

You're example is missing a few things though, like pil_img_to_torch() the tqdm import and the collect_and_empty() function

I assume it's something like:

def collect_and_empty():
    gc.collect()
    torch.cuda.empty_cache()

6

u/Aqwis Sep 12 '22

Sorry, I went and added pil_img_to_torch to the gist now! I removed collect_and_empty a couple of hours ago as it was slowing things down and the VRAM issue mysteriously vanished.

2

u/rservello Sep 12 '22

Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

thoughts on this error now?

1

u/Etiennera Sep 12 '22

Did you halve your model to save vram?

1

u/rservello Sep 12 '22

I did. But I tried at full and get the same error.

3

u/Inevitable_Impact_46 Sep 11 '22

I'm guessing:

def pil_img_to_torch(img, half=True):
img = img.convert('RGB')
img = torch.tensor(np.array(img)).permute(2, 0, 1).float()
if half:
    img = img.half()
return img

1

u/rservello Sep 11 '22

I'm getting an error, pil_img_to_torch not defined. Do you know how to fix this?

3

u/backafterdeleting Sep 12 '22

I wonder how the prompt you use for reversing the noise affects how you can alter the image by changing the prompt, before getting an unrecognisable image.

E.g: You used "photo of a smiling woman with brown hair"

but if you just used "photo of a smiling woman" and got the noise for that prompt, and then added "with blue hair", would it be a worse result?

Or if you added "in the park on a sunny day" could you then more easily change it to, "on a rainy day"?

3

u/Aqwis Sep 12 '22

Yes, you're exactly right – when I made the examples I first used the noising prompt "photo of a smiling woman" and got inconsistent results when generating images with "...with X hair" added to the prompt. After adding "...with brown hair" to the noising prompt the results improved significantly.

On the other hand, for other pictures I've had the most success noising them with a CFG scale (cond_scale) setting of 0.0, which means that the prompt used when noising should have no impact whatsoever. In those cases I've often been able to use prompts like "photo of a woman with brown hair" in image generation despite that!

It's hard to conclude anything besides this method being quite inconsistent both in terms of how well it works and which settings lead to the best results. As mentioned I hope that combining this with prompt-to-prompt image editing can lead to more consistent results.

2

u/rservello Sep 11 '22 edited Sep 11 '22

What does this return? A seed value? If it produces a latent image or noise sample that needs to be inserted, where is that done? Can you provide more info on how to actually use this?

2

u/dagerdev Sep 11 '22

The output noise tensor can then be used for image generation

This could be a ignorant question, I hope not. But this output noise tensor can be translated back to an image? That would help a lot to visualize it.

2

u/starstruckmon Sep 11 '22

Yes. Just run it through the decoder. I'm pretty curious what it looks like too.

1

u/jfoisdfbjc218 Sep 11 '22

I'm trying to run this script after copying it to my script folder, but it keeps telling me there's "No module named 'k_diffusion'". How do I install this module? I'm kind of a noob.

2

u/ParanoidConfidence Sep 11 '22

I don't know the answer, but this has been discussed before, maybe in this link lies the answer for you?

https://www.reddit.com/r/StableDiffusion/comments/ww31wr/modulenotfounderror\_no\_module\_named\_k\_diffusion/

1

u/WASasquatch Sep 13 '22

Maybe your k_diffusion us under the folder "k-diffusion" like mine. I had to change to k-diffusion.k_diffusion

1

u/EmbarrassedHelp Sep 12 '22

I love how your example image appears to be a StyleGAN 2 rendering, instead of a real stock photo.

1

u/summerstay Sep 12 '22

This is cool! Once I find the noise vector for a starting image, how do I then generate a new version of the starting image with a revised prompt? I don't see the code for that. Or, if it is a simple modification to txt2img.py or img2img.py, maybe you can just explain what I need to do.

1

u/the_dev_man Sep 12 '22

where i get the model virable from? can someone make a colab working example? just with this feature?

1

u/mflux Sep 14 '22

Is there any way to use this with the command line version of img2img?

Do you have a colab example of it working with the standard stable diffusion colab notebook?