Is there any paid or free tool which we can use to generate AI images with the product images that we upload. For example lets say we have a cap and I want to generate a human image wearing that same cap?
The basic idea is you find a rare unused token (like say f$#sdafad) and then finetune your image generation model with a specific set of images (say 20 images of your red cap at various angles) while telling it that f$#sdafad is the same thing as your red cap.
Then you can start prompting "f$#sdafad resting on the head of a monkey" and your cap will appear on a monkey's head.
The problem with this technique is the finetuning part. Finetuning can take minutes to hours depending on how many gpus you have and needs to be done individually for every new "token" you want to map to a specific individual or object you want to add to your pre-trained model.
Another strategy is to use some kind of autocropping strategy + generative infill. You can take a semantic segmentation model like Meta's "Segment Anything", then use it to segment out the item of interest manually (perhaps a UI could be built to make this a one-step process). Then take the mask and do a generative infill using some sort of image generation model like stable diffusion.
Thanks but it doesnt work. It takes the reference from image of the product and creates similar images. What I am looking for is to use the same product image
The specification of exactly what you want probably needs to be refined, but for that it's probably helpful on your end to know what is available and what is needed.
There are different methodologies for taking Stable Diffusion and adding some new concept (like a product with some images) to it. Hypernetworks, textual inversion and dreambooth are three methods.
I was trying to put the concept of specific people that I had a number of good photos of into Stable Diffusion. They say for images of people that dreambooth is better than hypernetworks and textual inversion, and it was. Hypernetworks were nowhere near as good for pictures of people, but I got a good picture once in a while. I never got anything good from textual inversion for pictures of people.
A more recent technique that is being used to put new concepts into Stable Diffusion is LoRa. I'm not that familiar with it but it being used and is another alternative method.
I have an Nvidia RTX 3060 card and spent hours (sometimes a day) making each model, and usually I had dozens of decent pictures (sometimes over 100) of each subject. It came out well. I needed a number of high quality photos from different angles of each subject. Not just from different angles but some full body shots, mid body shots and then some shots of just the face. Also if they were wearing glasses in 95% of the photos, most of the output would have them wearing glasses. Also if most of the photos I had were of them in group photos and I could barely crop their face out from the faces around them in the group shot, it was harder to do.
If you're willing to put in a little work it's possible, but you're going to have to get your product rendered or photographed at the angle you want it to appear in the photo, then mask the product while having the AI generate the rest of the image.
I've done this with some success using t-shirt mockup templates to get the color, shadows, folds and creases in the clothing right then regenerated everything around the shirt.
> Or... just take a picture of someone wearing the hat.
That’s the hard part right? Because you don’t want just any person. You want a very specific person, at a very specific location, with very specific lighting and angles, etc.
> lets say we have a cap and I want to generate a human image wearing that same cap
Do you have an image of the cap in the right orientation, as it would also appear as it is sitting on someone's head?
If not, any algorithm is necessarily going to have to invent what the cap looks like from another angle, making up details on any previously-hidden side and guessing at the depth of different parts of the still image in order to rotate it into the right orientation
If yes, crop it out and paste it onto the target head
Stable Diffusion with a suitably trained LoRA? The trick is in finding a tech which gives rapis LoRA fine tuning. I'm sure they exist, but I can't come up with one right now.
Once you get that post right, using something like Krita with AI Diffusion should give you a nice fast process flow.
In that image, a little bit of the backside of the cap is visible. The dark border at the right bottom.
So as I understand it, the AI would have to figure that out on its own and remove it before it adds the boy to the image?
Also, that image has watermarks all over it. Does that mean the AI has to detect and remove those?
The perspective of the cap is rather unusual for a photo on a human. As it would cover the eyes. Does that mean the eyes of the boy would be covered in the result? Or do you expect the AI to change the perspective of the product photo?
> Also, that image has watermarks all over it. Does that mean the AI has to detect and remove those?
I don’t think that’s what they meant. I think they mean they will use photos of products that they themselves (or the factories they buy it from) provide, without watermarks, and they want to add generated things like people etc into the photo while keeping the product itself the exact same as it was
The basic idea is you find a rare unused token (like say f$#sdafad) and then finetune your image generation model with a specific set of images (say 20 images of your red cap at various angles) while telling it that f$#sdafad is the same thing as your red cap.
Then you can start prompting "f$#sdafad resting on the head of a monkey" and your cap will appear on a monkey's head.
The problem with this technique is the finetuning part. Finetuning can take minutes to hours depending on how many gpus you have and needs to be done individually for every new "token" you want to map to a specific individual or object you want to add to your pre-trained model.
Another strategy is to use some kind of autocropping strategy + generative infill. You can take a semantic segmentation model like Meta's "Segment Anything", then use it to segment out the item of interest manually (perhaps a UI could be built to make this a one-step process). Then take the mask and do a generative infill using some sort of image generation model like stable diffusion.