r/StableDiffusion Sep 10 '22

Prompt-to-Prompt Image Editing with Cross Attention Control in Stable Diffusion


View all comments

Show parent comments


u/Zertofy Sep 10 '22

Cool! Also, is it take the same time to generate as usual image? Probably yes, but just to be sure. Some time before I see post here about video editing, and one of the problems was the lack of consistency between frames. I proposed use of the same seed, but it give only partial result. May this technology be the missing element for this?

Anyway, it's really exciting to see how people explore and upgrade SD in real time. Wish you success i quess


u/bloc97 Sep 10 '22

It is slightly slower, because instead of 2 u-net calls, we need 3 for the edited prompt. For video, I'm not sure if this can achieve temporal consistency, as the latent space is way too nonlinear, even with cross-attention control you don't always get exactly the same results (eg. backgrounds, trees, rocks might change shape when you are editing the sky). I think hybrid methods (that are not purely end-to-end) will be the way forward for video generation. (eg. augmenting stablediffusion with depth prediction and motion vector generation)


u/enspiralart Sep 12 '22

That augmentation, how do you think it should be gone about? For instance, a secondary network that feeds into the U-Net and gives it these depth and motion prediction vectors, which can be used to change the initial latents such that an image is generated from one frame to the next with roughly the same image latent, but motion vectors warping that image? Or yes, how?


u/bloc97 Sep 12 '22

I mean, some specific use cases such as animating faces, image fly through and depth map generation for novel view synthesis already exists. To generate video we probably need some kind of new diffusion architecture that can generate temporally coherent images, of which the data can be taken from YouTube, wiki commons, etc. But I don't think our consumer GPUs are powerful enough to run such a model.


u/enspiralart Sep 12 '22

There's an amazing conversation going on about it in the LAION discord group video-CLIIP

https://twitter.com/_akhaliq/status/1557154530290290688 this is from that group

Maciek — 08/10/2022 ok so they basically do what we've already done more thoroughly. Architecture is practically the same as well:
"we employ a lightweight Transformer decoder and learn a query token to dynamically collect frame-level spatial features from the CLIP image encoder"
this is just this - https://github.com/LAION-AI/temporal-embedding-aggregation/blob/master/src/aggregation/cross\_attention\_pool.py they
also just do action recognition but they do it on K400 which is easier.
I guess all the more evidence that this approach works.

LAION Discord video-clip group: https://discord.com/channels/823813159592001537/966432607183175730