r/StableDiffusion 17d ago

A better (?) way of doing img2img by finding the noise which reconstructs the original image Img2Img

Post image
825 Upvotes

151

u/Aqwis 17d ago edited 17d ago

I’ve made quite a few attempts at editing existing pictures with img2img. However, at low strengths the pictures tend to be modified too little, while at high strengths the picture is modified in undesired ways. /u/bloc97 posted here about a better way of doing img2img that would allow for more precise editing of existing pictures – by finding the noise that will cause SD to reconstruct the original image.

I made a quick attempt at reversing the k_euler sampler, and ended up with the code I posted in a reply to the post by bloc97 linked above. I’ve refined the code a bit and posted it on GitHub here:

link to code

If image is a PIL image and model is a LatentDiffusion object, then find_noise_for_image can be called like this:

noise_out = find_noise_for_image(model, image, 'Some prompt that accurately describes the image', steps=50, cond_scale=1.0)

The output noise tensor can then be used for image generation by using it as a “fixed code” (to use a term from the original SD scripts) – in other words, instead of generating a random noise tensor (and possibly adding that noise tensor to an image for img2img), you use the noise tensor generated by find_noise_for_image_model.

This method isn’t perfect – deviate too much from the prompt used when generating the noise tensor, and the generated images are going to start differing from the original image in unexpected ways. Some experimentation with the different parameters and making the prompt precise enough will probably be necessary to get this working. Still, for altering existing images in particular ways I’ve had way more success with this method than with standard img2img. I have yet to combine this with bloc97’s Prompt-to-Prompt Image Editing, but I’m guessing the combination will give even more control.

All suggestions for improvements/fixes are highly appreciated. I still have no idea what the best setting of cond_scale, for example, and in general this is just a hack that I made without reading any of the theory on this topic.

Edit: By the way, the original image used in the example is from here and is the output of one of those old "this person does not exist" networks, I believe. I've tried it on other photos (including of myself :), so this works for "real" pictures as well. The prompt that I used when generating the noise tensor for this was "Photo of a smiling woman with brown hair".

74

u/GuavaDull8974 17d ago

This is spectacular! I made feature request for it already on webui, you think you can produce actualy working comit for it ?

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/291

7

u/hefeglass 16d ago

Its been implemented by AUTOMATIC1111 but I cant seem to figure out how to use it. Anyone able to explain? I am trying to use the alternate img2img script.

14

u/jonesaid 16d ago

You go to the img2img tab, select the img2img alternative test in the scripts dropdown, put in an "original prompt" that describes the input image, and whatever you want to change in the regular prompt, CFG 2, Decode CFG 2, Decode steps 50, Euler sampler, upload an image, and click generate.

2

u/Plopdopdoop 15d ago edited 14d ago

So when I try those settings the output isn't anything close, like not even recognizable objects in the resulting image (original being 'picture of man wearing a read shirt').

6

u/jonesaid 15d ago

It seems to be very sensitive to decode cfg and decode steps. I use decode cfg at about 2, and decode steps from 35-50. Make sure regular cfg is about 2 too.

2

u/BeanieBytes 14d ago

I'm also getting this issue. does my denoising strength need to be altered?

→ More replies

1

u/2legsakimbo 16d ago

hhmm that alternative test isnt showing up. even though i just forced updated by deleting verv and repositary folders.

I must have missed a step

2

u/tobboss1337 16d ago

You just deleted the additional python repos and environment. So you returned to the state of initial downloading but not the newest version. Did you pull the changes from Automatic's repo?

1

u/2legsakimbo 16d ago edited 16d ago

no, thank you for letting me know that i have to do that.

→ More replies

4

u/redboundary 17d ago

Isn't it the same as setting "masked content" to original in the img2img settings?

50

u/animemosquito 17d ago

no, this is finding which "seed" basically would lead to SD generating the original image, so that you are able to modify it in less destructive ways.

22

u/MattRix 17d ago

yep exactly! Though to be somewhat pedantic it’s not the seed, it’s the noise itself.

7

u/animemosquito 17d ago

Yeah that's a good distinction to make, I'm trying to make it accessible and less complicated, but it's important to make the distinction that the seed is what is used to produce the initial noise, which is used to diffuse / iterate on to get to a final product

5

u/Trainraider 16d ago

It's a really important distinction because there's a lot more potential entropy in the noise than in the seed. There may be a noise pattern that results in the image, but there probably isn't a seed that makes that specific noise pattern.

9

u/wildgurularry 16d ago

It's true... there are only 2^32 possible seeds, but almost 2^6291456 possible noise patterns for a 512x512 image.

9

u/ldb477 14d ago

That’s at least double

→ More replies

6

u/almark 17d ago

this means we can keep the subject we like and alter it, move the model, poses, different things in the photo.

1

u/SkipperScupper 17d ago

... make perfecto hands, I'd hazard a guess

3

u/almark 16d ago

hands are floppy things - laughs

I still have nightmares from first glance in SD.

→ More replies

50

u/bloc97 17d ago

Awesome, I can't wait to combine this with cross attention control, this will actually allow people to edit an image however they want at any diffusion strengths! No more the problem of img2img ignoring the initial image at high strengths. I will take a look at the code tomorrow...

Also I believe (and hope) that inpainting with this method with cross attention control would yield far superior results than simply masking out parts of an image and adding random noise. What a time to be alive!

6

u/enspiralart 16d ago

2 minute papers bump!

4

u/gxcells 17d ago

Then you will probably update your jupyter notebook with k diffusers?

4

u/bloc97 17d ago

The current version uses k-lms by default.

2

u/gxcells 17d ago

Ok, thanks a lot

9

u/no_witty_username 17d ago

God speed my man. This feature is gonna be massive.

13

u/ethereal_intellect 17d ago edited 17d ago

The prompt that I used when generating the noise tensor for this was "Photo of a smiling woman with brown hair".

Wait, so it take the assumed prompt as input? What if you put a wrong prompt, like a photo of a dog with brown hair. Does the learned noise overwrite the prompt and still draw a human face? I see u/JakeWilling asked basically the same too. It would/could be interesting if "close enough" descriptions from the blip+clip system work

Edit: There's also https://github.com/justinpinkney/stable-diffusion this which uses image embeddings instead of text. Wonder if it would make the reconstructions more accurate? Tho at that point you got no variables left to control lol

Edit2: Style transfer with the above might be interesting, get clip image1, get noise seed, get clip image2 and run it on the same seed

2

u/2legsakimbo 16d ago

Edit: There's also https://github.com/justinpinkney/stable-diffusion this which uses image embeddings instead of text. Wonder if it would make the reconstructions more accurate? Tho at that point you got no variables left to control lol

this looks amazing

10

u/AUTOMATIC1111 17d ago

That last line in gist where you multiply by sigmas[-1] was completely destroying the picture. Don't know if you added it in jest or something but it took a lot to discover and fix it.

8

u/[deleted] 17d ago

[deleted]

3

u/ByteArrayInputStream 17d ago

Haven't tried it, but my guess would be that it wouldn't be able to find a seed that accurately resembles the original image

3

u/Doggettx 17d ago edited 17d ago

Very cool, definitely gonna have to play with this :)

You're example is missing a few things though, like pil_img_to_torch() the tqdm import and the collect_and_empty() function

I assume it's something like:

def collect_and_empty():
    gc.collect()
    torch.cuda.empty_cache()

5

u/Aqwis 17d ago

Sorry, I went and added pil_img_to_torch to the gist now! I removed collect_and_empty a couple of hours ago as it was slowing things down and the VRAM issue mysteriously vanished.

2

u/rservello 17d ago

Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

thoughts on this error now?

1

u/Etiennera 17d ago

Did you halve your model to save vram?

1

u/rservello 17d ago

I did. But I tried at full and get the same error.

3

u/Inevitable_Impact_46 17d ago

I'm guessing:

def pil_img_to_torch(img, half=True):
img = img.convert('RGB')
img = torch.tensor(np.array(img)).permute(2, 0, 1).float()
if half:
    img = img.half()
return img

1

u/rservello 17d ago

I'm getting an error, pil_img_to_torch not defined. Do you know how to fix this?

3

u/backafterdeleting 16d ago

I wonder how the prompt you use for reversing the noise affects how you can alter the image by changing the prompt, before getting an unrecognisable image.

E.g: You used "photo of a smiling woman with brown hair"

but if you just used "photo of a smiling woman" and got the noise for that prompt, and then added "with blue hair", would it be a worse result?

Or if you added "in the park on a sunny day" could you then more easily change it to, "on a rainy day"?

3

u/Aqwis 16d ago

Yes, you're exactly right – when I made the examples I first used the noising prompt "photo of a smiling woman" and got inconsistent results when generating images with "...with X hair" added to the prompt. After adding "...with brown hair" to the noising prompt the results improved significantly.

On the other hand, for other pictures I've had the most success noising them with a CFG scale (cond_scale) setting of 0.0, which means that the prompt used when noising should have no impact whatsoever. In those cases I've often been able to use prompts like "photo of a woman with brown hair" in image generation despite that!

It's hard to conclude anything besides this method being quite inconsistent both in terms of how well it works and which settings lead to the best results. As mentioned I hope that combining this with prompt-to-prompt image editing can lead to more consistent results.

3

u/rservello 17d ago edited 17d ago

What does this return? A seed value? If it produces a latent image or noise sample that needs to be inserted, where is that done? Can you provide more info on how to actually use this?

2

u/dagerdev 17d ago

The output noise tensor can then be used for image generation

This could be a ignorant question, I hope not. But this output noise tensor can be translated back to an image? That would help a lot to visualize it.

2

u/starstruckmon 17d ago

Yes. Just run it through the decoder. I'm pretty curious what it looks like too.

1

u/jfoisdfbjc218 17d ago

I'm trying to run this script after copying it to my script folder, but it keeps telling me there's "No module named 'k_diffusion'". How do I install this module? I'm kind of a noob.

2

u/ParanoidConfidence 17d ago

I don't know the answer, but this has been discussed before, maybe in this link lies the answer for you?

https://www.reddit.com/r/StableDiffusion/comments/ww31wr/modulenotfounderror\_no\_module\_named\_k\_diffusion/

1

u/WASasquatch 16d ago

Maybe your k_diffusion us under the folder "k-diffusion" like mine. I had to change to k-diffusion.k_diffusion

1

u/EmbarrassedHelp 17d ago

I love how your example image appears to be a StyleGAN 2 rendering, instead of a real stock photo.

1

u/summerstay 16d ago

This is cool! Once I find the noise vector for a starting image, how do I then generate a new version of the starting image with a revised prompt? I don't see the code for that. Or, if it is a simple modification to txt2img.py or img2img.py, maybe you can just explain what I need to do.

1

u/the_dev_man 16d ago

where i get the model virable from? can someone make a colab working example? just with this feature?

→ More replies

57

u/sassydodo 17d ago

You should summon hlky and automatic in this thread or either do pull request on this into their webUIs repos - that would be much better from user experience side

I think I've seen some work in either hlky or auto's repo that mentioned cross attention control

45

u/MarvelsMidnightMoms 17d ago

Automatic1111 has been so on the ball with updates to his fork these past 2 weeks+. Just today he added "Interrogate" in his img2img tab, which is img2prompt.

Yesterday, or the day before, he added prompt "presets" to save time on retyping your most commonly used terms.

Hlky's activity has died down quite a bit which is a bit unfortunate. His was the first webui fork that I tried.

26

u/Itsalwayssummerbitch 17d ago

Hlky's is essentially going through a whole remake in streamlit UI, it should be much better than before and be easier to add things to it in the future, but it's going to take a week or two to get it out of dev stage.

The gradio version is only getting bigfixes btw, no new features as far as I'm aware.

Either way feel free to add it in the discussion section of the repo 😅

10

u/hsoj95 17d ago

^ This!

We are still looking for features to add, and I'm gonna send a link to this to the discord for Hlky's fork.

2

u/ImeniSottoITreni 17d ago

Automatic1111 has been so on the ball with updates to his fork these past 2 weeks+. Just today he added "Interrogate" in his img2img tab, which is img2prompt.

Can you please give me some more info and compare about hlky and automatic?
I tought they were 2 dead repos. I mean, they put out their thing: hlky with webui and AUTOMATIC1111 with the outpainting stuff and that was it.

I pushed so far to make a pull request to neonsecret repo to add webui and he accepted to merge hlky webui, which is basically a fork that allows you to make high res images with low vram

But I'm loosing a bit of grip on all the news. Can you please tell me what we have now? and what are the news for hlky and others?

2

u/matahitam 16d ago

You might want to use the dev branch for bleeding edge in hlky (re base to sd-webui) . There's also a discord, link is in readme if I'm not mistaken.

2

u/matahitam 16d ago

Adding discord link here for reference. https://discord.gg/frwNB7XV

1

u/ImeniSottoITreni 16d ago

Thanks I will!

6

u/sassydodo 17d ago

yeah, I'd go with auto's version, but hlky has got all the fancy upscalers like GoBIG and also it doesn't crash as much as auto's. Tho Im still on auto's friday version, so it might have been fixed already.

4

u/halr9000 17d ago

Hlky is switching to atreamlit but seems features are still going into both branches. GoBig is sweet! I think auto added similar called sd-upscale but I haven't tried it yet.

12

u/AUTOMATIC1111 17d ago

I added sd upscale and afterwards hlky specifically copied my code of sd upscale code and added it as gobig

1

u/th3Raziel 16d ago

I just want to say HLKY himself didn't do it, I did, I saw your implementation and used it (and txt2imghd) as the base for GoBig in the hlky fork, I'm not sure why is this so forbidden as large parts of the hlky fork is already copied code from your repo so I didn't even think twice about utilizing it.

I also added LDSR to the hlky fork which I modified from the original repo and created the image lab tab etc.

To be clear, I'd rather add stuff to your repo but I approached you on the SD discord and you said you'll likely not merge PRs that aren't made by you and that originally hlky PR'd a feature to your repo which you rejected which in turn prompted him to make his own fork.

It's too bad there's all this useless drama around the different UIs, it just creates a lot of confusion.

3

u/AUTOMATIC1111 16d ago

If I'm remembering correctly, I said that I won't accept big reworks unless we decide on them beforehand. I'm accepting a fair amount of code from different people.

The 'feature' I rejected was a change that would save all pictures in jpeg format for everyone.

1

u/StickiStickman 16d ago

But why change the name then? Huh.

2

u/th3Raziel 16d ago

I changed the name to GoBig as it's the original name for this approach.

2

u/AUTOMATIC1111 16d ago

To make it less obvious to user that he copied it.

1

u/halr9000 16d ago

Well, if true that's not cool. Should be relatively easy to prove by looking at commits. But the UIs are definitely diverging, so there's original work being done to some extent. Sorry if there's some bad behavior going on though.

3

u/Itsalwayssummerbitch 16d ago

It's funny you mention the commits, they DO seem to tell a different story. The funniest of which is that the code Automatic111 used for the sdupscale was originally called "text2imagehd", and a port of someone else's work, which was called GoBig :)

https://github.com/jquesnelle/txt2imghd The link was literally in the Auto's code's comments.

Seriously though, ffs, this is open source, can we not just be decent humans and work together? I don't get all this drama, it's not Middle school 🙃

5

u/AUTOMATIC1111 16d ago

i credit the person who made txt2imghd both in comments and in the main readme in credits section for the idea.

I also did not take a single line of his code.

The decision to not work with me was on hlky, he was the one who forked my repo.

You're free to link the different story in commits because I do not see it.

→ More replies

1

u/TiagoTiagoT 17d ago

Are the two projects different enough they can't be merged?

16

u/jansteffen 17d ago

The hlky one actually started as a fork of the automatic1111 UI, but that was basically on day 1 of SD release and they've both changed a ton since then, with major reworks and refactors, sometimes even implementing the same functionality in different ways. I don't think merging them would be possible at this point, it'd be a lot easier to just cherry pick features that one has that the other one doesn't and weave that code into the existing code base.

1

u/ts4m8r 17d ago

How do you install new versions of webui if you already have an old one installed?

3

u/sassydodo 16d ago

I mean "installed" is just put in a folder with moldels placed in, everything else is in virtual environment. You can just download new version, or use git - in this case you just git clone once, and use git pull every time you think there's a worthy update

1

u/matahitam 16d ago

Often it's as simple as performing git pull. Let me know in sd-webui discord if you need more details. https://discord.gg/frwNB7XV

2

u/manueslapera 15d ago

that's a shame, Id rather manage a python environment (hlky ui) than having to install .net just to use automatic's

3

u/Dogmaster 17d ago

And he still hasnt fixed the masking bug causing deepfrying, the commit is waiting :(

53

u/gxcells 17d ago

That's just incredible, you unlocked the next generation of Photoshop. I can't describe how crazy this last month has been since SD release. I wish I had studied coding to participate to all of this.

17

u/Caldoe 17d ago

haha just wait for a few weeks, people are already coming out with GUI for normal people

It won't take long

7

u/ExponentialCookie 17d ago

It's never too late. There are more resources now than ever.

1

u/Still_Jicama1319 15d ago

is python enough to understand all this terminologies?

2

u/ExponentialCookie 15d ago

At a high level, it's a must to understand how the applications are built. Beyond that, linear algebra is pretty much a prerequisite for building out the neural networks. Understanding the jargon isn't too hard, but the implementation is the hard part.

27

u/entrep 17d ago

4

u/kaliber91 17d ago

Is there a simple way to update to from the previous version to the newest on PC, or do we need to go through the installation process from the start?

10

u/-takeyourmeds 17d ago

literally download the repo zip and extract on the main folder, say yes to overwrite all

1

u/Limitlez 16d ago

Were you able to figure out how to use the script in webui? I was able to run it, but could never find the seed.

6

u/Dogmaster 17d ago

You can use a tool like beyond compare, check both folders and just merge the files changed form the old revision

I use that for "updating" my working repos

2

u/kaliber91 17d ago

thanks, worked

7

u/ExponentialCookie 17d ago

On Linux, a simple git pull in the install directory works for me. I can't speak on Windows install.

7

u/justhitmidlife 17d ago

Should work on windows as well.

5

u/an0maly33 17d ago

Yep, just git pull on windows as well, assuming one cloned the repo initially.

2

u/jonesaid 17d ago

I can't wait to try this! Now, just gotta get Automatic's repo working without CUDA OOM errors...

2

u/Scriptman777 16d ago

You can try to add the --medvram parameter or even the low one. It will be a lot slower, but it work with MUCH less VRAM. Also try to keep the images small.

1

u/jonesaid 15d ago

Yeah, I tried that, and it was about 2x slower. I think I had a package issue (maybe with PyTorch) that was causing the oom problems. Once I fixed that, automatic's repo worked without the optimizations.

16

u/Adreitz7 17d ago

This is great! I like to see these innovations that dive into the inner workings of SD. This looks like a powerful feature. In your example mosaic, is the second image meant to be the base reconstruction, and the following images modifications of it? I’m asking because the second image looks most like the first, but I noticed that it is more vivid — the saturation has increased. It’s a minor thing here, but could cause problems if it is a general effect of your technique. Any idea why this happened?

21

u/Aqwis 17d ago

Yeah, the second image is basically the base reconstruction. In general, converting an image to its latent representation and then back again to an image is going to lose a little bit of information, so that the two images won't be identical, but in most cases they will be very close. However, in this case I think the difference in contrast is caused by what happens at the very end of find_noise_for_image, namely:

return (x / x.std()) * sigmas[-1]

This basically has the effect of increasing the contrast. It shouldn't be necessary, but if I don't do this then in many cases the resulting noise tensor will have a significantly lower standard deviation than a normal noise tensor, and if used to generate an image the generated image will be a blurry mess. It's quite possible the need to do this is caused by some sort of bug that I haven't discovered.

13

u/Adreitz7 17d ago

It’s also fascinating how perfect the reconstruction is. The biggest changes I can see are the shape of the right eyebrow and the profile of the left cheek. Your technique reproduced individual strands of hair!

15

u/Aqwis 17d ago

It's very likely that the reconstruction isn't actually as good as it could be – I used 50 sampler steps to create the noise tensor for this example and 50 to generate each of the images from the noise tensor, but I'd previously noticed that the reconstructions seemed to be even better if I used a few hundred sampler steps to create the noise tensor.

13

u/jonesaid 17d ago

Hmm, I wonder if this would have made my work on the Van Gogh photo recreation much easier.

Starting from his 1887 self-portrait as input image, I struggled with getting a very painted look like the original at low denoising strength, or a completely different person at higher strengths. I wanted to keep the composition of person basically the same, while changing just the style of the image. I wanted to tell SD to make this painting input in the style of a studio photograph. Using weights in the prompt helped somewhat (e.g. "studio photograph :0.8").

Would your technique help with that kind of restyling?

13

u/HarisTarkos 17d ago

Wow, with my very little comprehension of the mechanics of diffusion i didn't think it was possible to do such a "renoising" (i thought it was a bit like finding the original content from a hash). This feels like an absolute killer feature...

6

u/starstruckmon 17d ago

Your thought wasn't completely wrong. What you're getting here is more like an init image than noise. Even if the image was a generated one, you'd need the exact same prompt ( and some of the other variables ) used during generation to get actual gaussian noise or even close.

Since those are not available, and the prompt is guessed , what's happening here can be conceptualized more as ( essence of that picture ) - ( essence of that guessed prompt ). So the init image ( actually latents ) you're left with after this process has all the concepts of the photo that's not in the the prompt "photo of a smiling woman with brown hair" i.e. composition , background etc.

Now what that init image ( if converted to image from latents ) looks like and whether it's even comprehensible as that by the human brain, I'm not sure. It would be fascinating to see what it looks like and if it's comprehensible.

2

u/Bitflip01 14d ago

Am I understanding correctly that in this case the init image replaces the seed?

11

u/Aqwis 17d ago edited 17d ago

Made a few incremental updates to the Gist over the past few hours. Happy to see that a few SD forks/UIs are implementing something like this – they're better situated than me to make something that's useable by non-coders. :)

It seems that the results are quite often best when cond_scale is set to 0.0 – exactly why this is, I don't know. If anyone has an idea, I would love an explanation. With cond_scale at zero, the given prompt has no effect.

In the meantime, I've got to see my share of extremely creepy pictures while experimenting with other cond_scales. Run this on a portrait with cond_scale set to 5.0 and use the resulting noise to generate a picture (also with scale > 2.0) ... or don't. I wouldn't advise doing so personally, especially if you have a superstitious bent. (Or maybe you're going to get completely different results than I got, who knows?)

5

u/protestor 17d ago

Happy to see that a few SD forks/UIs are implementing something like this – they're better situated than me to make something that's useable by non-coders. :)

There's this https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/9c48383608850a1e151985e814a593291a69196b but shouldn't you be listed as the author? (in that commit, https://github.com/AUTOMATIC1111 is the author)

2

u/NotModusPonens 17d ago

In what way are the pictures creepy?

5

u/Aqwis 17d ago

To be a bit vague: a combination of "photos" of very seriously messed up human-like figures and "drawings" of symbols that if they meant anything would have been the equivalent of these messages for the human psyche.

2

u/NotModusPonens 17d ago

Ooof.

... we'll soon have to disable images in social media and email by default in order to avoid being "trolled" by someone with one of these, won't we?

3

u/Lirezh 13d ago

Anyone with photoshop can troll you already for more than a decade, it does not seem to be a big concern.

→ More replies

2

u/gxcells 17d ago

I am using the automatic1111 implementation of your code. It is really difficult to have an effect of a prompt on generating a new image (hair color change or adding a helmet for example). Often it changes the whole face etc

1

u/Limitlez 16d ago

Are you using it through webui? If so, how do you use it? I can't seem to figure it out

2

u/gxcells 16d ago

You use this colab https://colab.research.google.com/drive/1Iy-xW9t1-OQWhb0hNxueGij8phCyluOh Then in img2img tab, at the bottom you can find a dropdown menu for scripts, just use the script "img2imgalternate"

1

u/thedarkzeno 16d ago

https://colab.research.google.com/drive/1Iy-xW9t1-OQWhb0hNxueGij8phCyluOh

got an error:

Loading model [e3b0c442] from /content/stable-diffusion-webui/model.ckpt
---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
<ipython-input-3-75bc94f91c1d> in <module>
2 sys.argv = ['webui.py', "--share", "--opt-split-attention"]
3
----> 4 import webui
5 webui.webui()
3 frames
/usr/local/lib/python3.7/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
918 "functionality.")
919
--> 920 magic_number = pickle_module.load(f, **pickle_load_args)
921 if magic_number != MAGIC_NUMBER:
922 raise RuntimeError("Invalid magic number; corrupt file?")
EOFError: Ran out of input

10

u/ExponentialCookie 17d ago

This seems to be a very similar method to RePaint.

5

u/LetterRip 17d ago edited 17d ago

You are right it does do that for the unmasked part.

31

u/no_witty_username 17d ago

This is huge. The ability to find a latent space representation of the original image in SD model opens up soooo many opportunities. This needs to be implemented in every repo. I see this being a standard feature for every repo out there.

2

u/Fazaman 16d ago

The ability to find a latent space representation of the original image in SD model

So... uh... what does this mean for us that aren't as deep into the weeds as half the people on this sub seem to be?

16

u/HorrorExpress 17d ago

I've been following bloc97's posts, while trying (slowly) to learn how this all works.

I just wanted to tip my hat to you both for the work you're doing.

I'm finding Stable Diffusion, as is, isn't remotely able to do what you've both started to do with it. I've had much frustration with how changing the color prompt for one part of the image changes it for other elements. Your example - like bloc's - look awesome.

Keep up the great work.

6

u/tinman_inacan 16d ago

Can you provide a bit of a technical explanation of how to apply this technique?

Automatic1111 has implemented your code on the webui project, and I've been trying it out. It works perfectly for recreating the image, but I can't seem to figure out how to actually do anything with it. It just comes out looking exactly the same - overbaked - no matter how I mess with the settings or prompt.

Still, absolutely incredible that you threw this together, especially without reading the theory behind it first!

3

u/Daralima 16d ago

That's odd, especially that the settings have no effect. Are you changing the original prompt window perhaps? I've also found that that has no effect whatsoever, even when left empty. You need to change the regular prompt, if you aren't doing so already. However using your original prompt or using a prompt that makes sense given the image (or alternatively using clip interrogator) as a base in the normal prompt window seems to work well, as I used the exact same prompts as in the image of this post along with the original image and got nearly identical results to the author.

This is my experience overbaking issue, but since you say that changing the settings does nothing I am not sure if it'll help in your case:

there seems to be a strong interplay between the decode settings and the regular sampling step count: increasing the decode CFG scale and steps all the way to 0.1 and 150 respectively seems to fully fix the overbaking when also combined with a somewhat unusually low step count; 10-20 seemed to work in the case I first tried (and seems to work as a general rule for other attempts I've made). But these settings do not seem to work universally:

sometimes setting the CFG scale too low seems to remove certain details, so experimenting with values between 0.1 and 1 is worthwhile if certain things are missing or look off (assuming those things are of consequence). And while decode steps seem to always decrease the level of overbake, it does not always seem to result in something closer to the original, and in a couple cases it made some weird changes instead.
I'd recommend testing with 0.1 and 150 decode CFG/steps at first, with a low sampling count and an empty prompt to make sure the image recreation goes as hoped, until you're really close to the original without much/any overbake. Then decreasing/increasing one by a fairly large amount if it doesn't yield good results, and once you've got the image you want you can either add the whole prompt like in this post and edit that, or add keywords which seems to give a similar effect.
Hope this is coherent enough to be somewhat helpful if you haven't figured it out by now!

If the author sees this comment, please correct anything that doesn't add up as I've figured all this out through experimentation and know nothing about the underlying code.

2

u/tinman_inacan 16d ago

Thank you so much for your detailed response! With the help of your advice and a lot of trial and error, I think I have it working now. Still having trouble with overbaking, but at least I have some idea of what's going on. I think I was just confused about which prompts do what, which settings are disabled, how the sliders effect each other, etc.

At least I got some neat trippy pyramids out of it lol.

4

u/WASasquatch 17d ago

This is pretty awesome, man. I'm wondering if this is possible with regular diffusers? Or is this something special with k-diffusion?

3

u/LetterRip 17d ago

should likely work for most samplers that are deterministic.

1

u/WASasquatch 17d ago

I guess my real question is "I don't understand the implementation, how do I implement it?" like a newb. Is the noise_out overriding some variable for diffusion?

4

u/AnOnlineHandle 17d ago

This is another incredible development.

3

u/[deleted] 17d ago edited 17d ago

[deleted]

9

u/borntopz8 17d ago

i guess the development of this feature is still in an early state, but i managed to get the first results.
basically you upload an image in img2img,
interrogate to obtain the prompt ---this gives me a low vram error but still generates the prompt that you'll find on top---
in the scripts you use img2imgalternative with that prompt you have obtained (check https://github.com/AUTOMATIC1111/stable-diffusion-webui in the img2imgalt section for the parameters they are very strict for now)
now generate and you should get an output very similar to your original image
if you change your main prompt now (still running the script with the previously obtained prompt) you should be able to modify the image keeping most of the details

3

u/Z3ROCOOL22 17d ago

I don't understand this part "interrogate to obtain the prompt---" Where you do that?

4

u/borntopz8 17d ago edited 16d ago

speaking about automatic1111 and his webui you should see in the img2img a button to generate and a button to interrogate. if not, update to the last version because the are making changes by the minute.

1

u/Z3ROCOOL22 16d ago

Yeah, i figured out now, thx.

1

u/gxcells 17d ago

It works well to regenerate the original. But I could not make a change in the prompt without changing completely the picture (portrait).

5

u/borntopz8 17d ago

if you regenerate the original and change the main prompt (keeping the script img2imgalt on the original prompt the interrogation gave you) you should be able to have less "destructive" results
application of a style works well, but sometimes -let's say changing shirt color or hair color- is still too similar or too far from the image.

the implementation is in a very early state the most i can do is keeping my fingers crossed since i dont know much about coding and i rely heavly on repos and webuis.

1

u/gxcells 16d ago

Thanks, I'll try this and play around also with different source images tonight

4

u/Dark_Alchemist 12d ago

Try as hard as I could I never could get this to work. A dog is wearing a collar with a bell and it changed the colour of the dog and made its big floppy ears into flowers. If you can't get it to work before adjusting it will never be right, and at 3 minutes per attempt I can't waste attempts.

3

u/GuavaDull8974 17d ago

Can you upscale with it somehow ?by synthesizing neighbour pixels

3

u/crischu 17d ago

Would it be possible to get a seed from the noise?

8

u/Aqwis 17d ago

Probably not, all the possible seeds can only generate a few of the possible noise matrices. If you want to share a noise matrix with someone else, the matrix itself can be saved and shared as a file, though.

3

u/Adreitz7 17d ago

How large is the noise matrix in comparison with the generated image? If you have to transmit a 512x512x8x8x8 (RGB) matrix to generate a 512x512 image, it would be better just to transmit the final image, especially considering that, for most normal images, lossless compression can reduce the size by a factor of two or more, while the noise matrix will likely be incompressible.

2

u/muchcharles 17d ago

Isn't the noise in latent space? 64x64x3(bytes? floats?)

1

u/Adreitz7 16d ago

But isn’t the latent space on the order of 800,000,000 parameters? That is even larger than a 512x512 image.

1

u/muchcharles 16d ago

Since latent diffusion operates on a low dimensional space, it greatly reduces the memory and compute requirements compared to pixel-space diffusion models. For example, the autoencoder used in Stable Diffusion has a reduction factor of 8. This means that an image of shape (3, 512, 512) becomes (3, 64, 64) in latent space, which requires 8 × 8 = 64 times less memory.

https://huggingface.co/blog/stable_diffusion

3

u/AnOnlineHandle 17d ago

Any idea if this would work with embeddings from textual inversion as part of the prompt?

3

u/use_excalidraw 15d ago

I made a tutorial on how to actually use this locally (with the AUTOMATIC repo) https://youtu.be/_CtguxhezlE

8

u/i_have_chosen_a_name 17d ago

Wait if it can find the latent space representation of the original image does that not mean every single combination of 512x512 pixel is present in the data set? How is that possible. Surely the latent space only contains an aproximation, no?

Also I’m blown away at the development speed of this after being open sourced. Google their Imagen and OpenAi dalle2 will never be able to compete with the open source fine tuning you can get from w couple million dev monkeys all fucking around with the code and model.

3

u/StickiStickman 16d ago

Surely the latent space only contains an aproximation, no?

Obviously, that's literally what he said though?

You also seem to have a bit of a fundamental misunderstanding how it works:

Wait if it can find the latent space representation of the original image does that not mean every single combination of 512x512 pixel is present in the data set?

It wouldn't mean that at all. It's not just copy pasting images from its dataset.

2

u/NerdyRodent 17d ago

Very nice!

2

u/[deleted] 17d ago

[deleted]

6

u/External_Quarter 17d ago

Automatic just got it working in his web UI. I would expect to see it there pretty soon!

2

u/hyperedge 17d ago

Looks great!

2

u/rservello 17d ago

pil_image_to_torch is not defined. Can you please update with fix?

3

u/Aqwis 17d ago

Added it now.

2

u/rservello 17d ago

Thank you :)

2

u/PTKen 17d ago

Looks like a fantastic tool! I wish I could try it. I still can't run this locally. Is anyone interested in putting this into a Colab Notebook?

6

u/ExponentialCookie 17d ago

It's just been implemented in AUTOMATIC1111's webui. Link here, instructions at this anchor.

3

u/PTKen 17d ago

Thanks for the link, but please correct me if I'm wrong. This is a web UI but you still need to have it installed locally. I cannot install it locally, so I am running it in Colab Notebooks for now.

3

u/cpc2 17d ago

Colab notebooks are local installs, just in a remote machine that you access through colab. https://colab.research.google.com/drive/1Iy-xW9t1-OQWhb0hNxueGij8phCyluOh this is the colab linked in automatic1111's github.

2

u/ExponentialCookie 17d ago

Sorry for misunderstanding. That is correct, but if you can get it to work in a colab notebook if you're willing to set it up.

2

u/PTKen 17d ago

No problem I appreciate the reply.

Well, it's a bit beyond me to figure out now to set up a Colab Notebook right now. That's why I was asking if anyone else was up to the task! :)

1

u/MysteryInc152 17d ago edited 17d ago

Hey !

So it's actually pretty easy to set up a collab notebook. Way easier than installing it locally.

A colab is basically text and media + code. Once you realize that, it all comes together. To run a snippet of code, you simply press the play button next to it.

Basically because it's text + code, colab notebooks are made to be ordered.

The only input coming from you is pressing the play button in the correct order. And remember the order has already been laid out for you. So Essentially, press the first one, scroll a bit, press the second one etc.

This site walks you through it

https://gigazine.net/gsc_news/en/20220907-automatic1111-stable-diffusion-webui#2

Honestly the only aspect that doesn't go like that is setting up a hugging face account but the site walks you through that as well. And it's something you only do once

2

u/no_witty_username 17d ago

I messed around with it in automatic and couldn't get it to work.

2

u/TheSkyWaver 17d ago

An idea i've had for a long while, but never really though that much into, is the concept of an image "compression" algorithm that uses some sort of image generation algorithm that takes a specific seed (previously generated with a preexisting image) and recreates that image via only the seed. Thereby effectively compressing the image far smaller than would ever be possible through conventional image compression.

This is basically that with the added benefit of not at all having a compressive effect die to the size and energy cost of actually running it, but also the ability to seamlessly edit any aspect of the image.

2

u/Adreitz7 17d ago

You have to keep in mind that you need to add the size of the generating software to get a good comparison, especially when that software is not widespread compared to, e.g., Zip or JPEG. Since SD is multiple gigabytes, well… But considering that it could conceivably generate most (all?) images this way and that Emad said on Twitter that he thinks the weights could be reduced to about 100MB, this might become more practical, though very compute-intensive.

On that note, I would be interested to see someone throw a whole corpus of images at this technique to see if there is anything that it cannot generate well.

2

u/starstruckmon 17d ago

The encoder and decoder ( from pixel space to latent space ) used in SD can already be used for this. You're not getting any more compression through this method.

The "noise" generated in this process is not gaussian noise that you can turn into a seed. It's a whole init image ( in the form of latents ) that needs to be transmitted.

So unlike the first method, where you only send the latents, in this method you send the latents + the prompt and also have to do a bunch of computation at the receiving end to create the image through diffusion instead of just running it through the decoder.

1

u/PerryDahlia 17d ago

that’s true, but the trade off works the wrong way given the current resource landscape. storage and bandwidth are cheap compared to gpu time and energy.

1

u/2022_06_15 17d ago

I think a useful variations of that idea are upscaling and in/outpainting.

You could make an image physically smaller in pixels and then seemlessly blow it up at the endpoint in a plausible and reliable way.

You could make an image with gaps and then get an algorithm to fill them in, effectively sending a scaffold for a particular image to be built upon/around. imgtoimg could probably work even better than that, you could just send a low res source image (or if you want to be particularly crafty, a vector that can be rasterised) and then fill in all the detail at the client end.

Of course, the part I'm really hanging out for is when this tech is ported to 3D. The requirement for complex and generative geometry is going to explode over the next couple of years, and if we use today's authoring technology the amount of data that will have to be pushed to the endpoints will make your eyes water. We can easily increase processing speed and storage footprint at rates we cannot comparably do for data transmission. That's going to be the next major bottleneck.

2

u/thomasblomquist 17d ago

If I’m to understand this correctly, you found a method to identify the correct “noise” seed that when using an “appropriate” prompt will recreate the image somewhat faithfully. Then, by tweaking the prompt using the identified seed, it will modify the appropriate attribute that was modified in the prompt?!????!!!!!!

That’s some insanity, and is amazing for what it is able to do. We’re in the future

2

u/Aumanidol 16d ago

Did anyone manage to get good results with AUTOMATIC implementation? My workflow is as follows:

  • I upload a picture

  • select "img2img alternative test"

  • select Euler (not Euler a)

  • hit interrogate

  • paste the found prompt into the "original prompt" box

  • change something in the prompt (the one on top of the page) and hit generate.

Results so far have been terrible, especially with faces.

I've read that better results were attained lowering "CFG scale" to 0.0 (this UI doesn't allow for that and I have no access to the terminal for a couple of days), but lowering it to 1 doesn't seem to be doing anything good.

Did anyone manage to get good results with AUTOMATIC implementation?

I've messed around with the decode parameters but nothing good came out of it either.

1

u/Aumanidol 16d ago

worth mentioning: the prompt produced with the interrogate button on the very same picture used above is the following "a woman smiling and holding a cell phone in her hand and a cell phone in her other hand with a picture of a woman on it, by Adélaïde Labille-Guiard"

am I using the wrong implementation?

1

u/wildgurularry 16d ago

Did you wind up getting anything working? Just playing around with it now, and the results are not quite as great as I expected. Of course, if I use the image of the woman posted above I get amazing results... but any of my own pictures that I have tried are failing miserably, unless they are just a fully cropped face.

2

u/enspiralart 16d ago

This is exactly what was missing, thanks so much! I am going to include it in my video2video implementation.

2

u/jaywv1981 16d ago edited 16d ago

Are you able to use this in the Automatic1111 colab or only locally? I ran the colab but don't see an option for it.

EDIT: Nevermind, I see it now at the bottom under scripts.

1

u/the_dev_man 16d ago

can i know where u found it?

2

u/RogueStargun 16d ago

What parameters did you set this to in order to prevent the network from altering the original appearance of the woman the the base prompt?

2

u/PervasiveUncertainty 15d ago

I spent the last few hours trying to reproduce this but couldn't get the changes requested to be incorporated into the picture. I used a sculpture of David by Michelangelo, he's looking to his left on the original, and couldn't get him to look straight into the camera.

Can you share the exact full settings you've used for the picture you've posted? Thanks in advance

2

u/Many-Ad-6225 13d ago

I have an error when I try to use "img2img alternative" Please help :( the error : "TypeError: expected Tensor as element 0 in argument 0, but got ScheduledPromptBatch"

→ More replies

2

u/kmullinax77 12d ago

I can't get this to work even a little bit.

I am using Automatic 1111's webUI and have followed the explicit settings on his Github site as well as u/use_excalidraw 's great Youtube video. I get nothing except the original photo, but a little overbaked.

Does anyone have any ideas why this may be happening?

→ More replies

1

u/flamingheads 17d ago

Mad props for figuring this out. It's so incredible to see all the development gushing so rapidly out of the community around this tech.

1

u/Sillainface 17d ago

Really interesting!

1

u/Hoppss 17d ago edited 16d ago

I've been working on how to do this as well, thank you for your insights!

1

u/IrreverentHippie 17d ago

Being able to use something from my previous generation in my next Generation would be awesome

1

u/BrandonSimpsons 17d ago

So this might be a dumb idea, but let's say you have two images (image A and image B).

You use this technique in order to back-form images of random noise (noise A and noise B) which will generate close approximations of image A and image B when given the same prompt (prompt P)

Can we interpolate between noise A and noise B, and feed these intermediate noises into stable diffusion with prompt P, and morph between image A and image B?

1

u/ExponentialCookie 17d ago

I don't see why not. Given a latent representation of an image, you should be able to latent walk through as many of them as you wish.

1

u/BrandonSimpsons 17d ago

I guess my question is more 'is the space organized enough for this to work feasibly', which probably can only be found experimentally.

1

u/[deleted] 16d ago

[deleted]

1

u/BrandonSimpsons 16d ago

oh yeah artbreeder is great, and being able to have similar tools with SD would be fantastic

1

u/fransis790 17d ago

Good, congratulations

1

u/RogueStargun 17d ago

This is incredible. I've been struggling with getting img2img to work to my satisfaction. I've been aiming to reverse a self portrait I painted many years ago into a photograph. I'll look into this!

1

u/tanreb 16d ago

How to execute “image variations” with this?

1

u/GuavaDull8974 16d ago

This already works in AUTOMATIC1111 webui! Under scripts img2img

1

u/ChocolateFit9026 16d ago

I'm eager to try this with video2video. So far, I've done some good ones
just with regular img2img and a for loop going through every frame of a
video. I wish there was an editable colab for this so I could try it.
Do you know of any img2img colab that has a k_euler sampler so I could
try this code?