Skip to content

Add support for custom conditioning image#812

Merged
Nerogar merged 7 commits intoNerogar:masterfrom
wenyifancc:custom_condimg
May 27, 2025
Merged

Add support for custom conditioning image#812
Nerogar merged 7 commits intoNerogar:masterfrom
wenyifancc:custom_condimg

Conversation

@wenyifancc
Copy link
Copy Markdown
Contributor

When training the Flux Fill model, custom conditioning images can help the model better learn specific behavioral concepts, such as object removal. Experiments have shown that the model cannot learn specific behaviors by masking alone, but can only learn the features of the masked subject. However, by custom conditioning images and giving the difference between the before and after images, the model can learn specific behavioral features, with satisfactory results.

Example:
Scenario case: Object removal
Base model:Flux-Fill-dev
Dataset:
https://huggingface.co/datasets/lrzjason/ObjectRemovalAlpha
Dataset folder structure:
1-condlabel.png //Before object removal
1.png //After object removal
1.txt //prompt txt such as rmo

Results after 1700 steps of training:
Original image:
image
Masked image:
image
Use flux fill with lora trained before:
Prompt: rmo
Result image:
image

@Nerogar
Copy link
Copy Markdown
Owner

Nerogar commented Apr 27, 2025

Just FYI: I think this is a great idea that can be really useful. But I see a small issue with the data loader and I haven't thought of a good solution yet.

I'd like to have a fallback to the original implementation for all images that don't have a custom conditioning image. MGDS has a fallback option that can be used for this, but it requires the modules to be added in the correct order. The module that outputs the loaded image needs to be placed after the cond image generation module. And I think the intermediate modules need to be aware of possible None values.

wenyifan added 3 commits April 28, 2025 16:48
# Conflicts:
#	modules/ui/TrainingTab.py
#	modules/util/config/TrainConfig.py
@yang-0201
Copy link
Copy Markdown

@wenyifancc Thank you for your work—I'm very interested in this direction as well! Have you conducted any follow-up research or testing on fine-tuning Flux fill with custom conditional images in Onetrainer to achieve better results on specific datasets?

@Nerogar Nerogar merged commit 16e938d into Nerogar:master May 27, 2025
1 check passed
@hnsywangxin
Copy link
Copy Markdown

@wenyifancc I noticed that your data format is different from the one in this link. That dataset requires three images: the original image, the ground truth after object removal, and the mask, while you only have two images. How would you train for the object removal task in this case?

@wenyifancc
Copy link
Copy Markdown
Contributor Author

@wenyifancc I noticed that your data format is different from the one in this link. That dataset requires three images: the original image, the ground truth after object removal, and the mask, while you only have two images. How would you train for the object removal task in this case?

During actual training and testing, it was found that providing only the pre-removal image (condition image) and the target image to be generated by the model (post-removal image) enables the model to learn the transformation pattern between them (the model learns a general pattern or behavioral concept, which differs from training for specific object targets). This approach resulted in better generalization performance, thus only requiring two images per pair in the dataset to train for object removal scenarios.

@hnsywangxin
Copy link
Copy Markdown

hnsywangxin commented Aug 1, 2025

During actual training and testing, it was found that providing only the pre-removal image (condition image) and the target image to be generated by the model (post-removal image) enables the model to learn the transformation pattern between them (the model learns a general pattern or behavioral concept, which differs from training for specific object targets). This approach resulted in better generalization performance, thus only requiring two images per pair in the dataset to train for object removal scenarios.

@wenyifancc Thank you for your replay, My goal is to remove objects using a specified mask, or in other words, my prompt is fixed, such as “remove this object.” I have two questions:

1:If only two images are used, it seems that this cannot be achieved.
2:If it is possible, then image 1 can be used as the condition image (file name is: 1-masklabel.png)(Actually, I don’t quite understand this part. Does the condition image refer to the original image or the image with the mask?), and image 2 as the post-removal image (file name is: 1.png). The txt file‘s contents is “remove this object.” Is my understanding correct?
image
image

@wenyifancc
Copy link
Copy Markdown
Contributor Author

During actual training and testing, it was found that providing only the pre-removal image (condition image) and the target image to be generated by the model (post-removal image) enables the model to learn the transformation pattern between them (the model learns a general pattern or behavioral concept, which differs from training for specific object targets). This approach resulted in better generalization performance, thus only requiring two images per pair in the dataset to train for object removal scenarios.

@wenyifancc Thank you for your replay, My goal is to remove objects using a specified mask, or in other words, my prompt is fixed, such as “remove this object.” I have two questions:

1:If only two images are used, it seems that this cannot be achieved. 2:If it is possible, then image 1 can be used as the condition image (file name is: 1-masklabel.png)(Actually, I don’t quite understand this part. Does the condition image refer to the original image or the image with the mask?), and image 2 as the post-removal image (file name is: 1.png). The txt file‘s contents is “remove this object.” Is my understanding correct? image image

Based on your description, your approach is actually equivalent to OneTrainer's mask training mode (i.e., using the image after object removal(Image 2), together with the mask; OneTrainer generates the condition image(Image1) based on the image2 and mask, then predicts Image 2). This method can achieve a certain level of object removal effect (which is how I originally did it), but it performs poorly in terms of both effectiveness and generalization, and requires a larger dataset for training. After some practical experiments, I changed my approach: instead, I directly use the image before object removal as the condition image and the image after removal as the target prediction. This allows the model to learn a specific behavioral pattern rather than the features of the target object, using only a small dataset, resulting in better performance and generalization. That's why I submitted this PR. This training method works well for object removal and clothes removal (NSFW warning hahaha~).Especially in the scenario of removing clothes, the generalization ability exhibited is surprisingly good.

@hnsywangxin
Copy link
Copy Markdown

@wenyifancc I see, thank you

@dxqb
Copy link
Copy Markdown
Collaborator

dxqb commented Feb 9, 2026

It might not have been known by that term at the time, but what you've implemented in this PR is edit training, as implemented by Flux2 (and similar to Qwen-Image-Edit).
Here is a PR to use it for Flux2: #1301

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants