Add support for custom conditioning image#812
Conversation
|
Just FYI: I think this is a great idea that can be really useful. But I see a small issue with the data loader and I haven't thought of a good solution yet. I'd like to have a fallback to the original implementation for all images that don't have a custom conditioning image. MGDS has a fallback option that can be used for this, but it requires the modules to be added in the correct order. The module that outputs the loaded image needs to be placed after the cond image generation module. And I think the intermediate modules need to be aware of possible None values. |
# Conflicts: # modules/ui/TrainingTab.py # modules/util/config/TrainConfig.py
|
@wenyifancc Thank you for your work—I'm very interested in this direction as well! Have you conducted any follow-up research or testing on fine-tuning Flux fill with custom conditional images in Onetrainer to achieve better results on specific datasets? |
|
@wenyifancc I noticed that your data format is different from the one in this link. That dataset requires three images: the original image, the ground truth after object removal, and the mask, while you only have two images. How would you train for the object removal task in this case? |
During actual training and testing, it was found that providing only the pre-removal image (condition image) and the target image to be generated by the model (post-removal image) enables the model to learn the transformation pattern between them (the model learns a general pattern or behavioral concept, which differs from training for specific object targets). This approach resulted in better generalization performance, thus only requiring two images per pair in the dataset to train for object removal scenarios. |
@wenyifancc Thank you for your replay, My goal is to remove objects using a specified mask, or in other words, my prompt is fixed, such as “remove this object.” I have two questions: 1:If only two images are used, it seems that this cannot be achieved. |
Based on your description, your approach is actually equivalent to OneTrainer's mask training mode (i.e., using the image after object removal(Image 2), together with the mask; OneTrainer generates the condition image(Image1) based on the image2 and mask, then predicts Image 2). This method can achieve a certain level of object removal effect (which is how I originally did it), but it performs poorly in terms of both effectiveness and generalization, and requires a larger dataset for training. After some practical experiments, I changed my approach: instead, I directly use the image before object removal as the condition image and the image after removal as the target prediction. This allows the model to learn a specific behavioral pattern rather than the features of the target object, using only a small dataset, resulting in better performance and generalization. That's why I submitted this PR. This training method works well for object removal and clothes removal (NSFW warning hahaha~).Especially in the scenario of removing clothes, the generalization ability exhibited is surprisingly good. |
|
@wenyifancc I see, thank you |
|
It might not have been known by that term at the time, but what you've implemented in this PR is edit training, as implemented by Flux2 (and similar to Qwen-Image-Edit). |




When training the Flux Fill model, custom conditioning images can help the model better learn specific behavioral concepts, such as object removal. Experiments have shown that the model cannot learn specific behaviors by masking alone, but can only learn the features of the masked subject. However, by custom conditioning images and giving the difference between the before and after images, the model can learn specific behavioral features, with satisfactory results.
Example:
Scenario case: Object removal
Base model:Flux-Fill-dev
Dataset:
https://huggingface.co/datasets/lrzjason/ObjectRemovalAlpha
Dataset folder structure:
1-condlabel.png //Before object removal
1.png //After object removal
1.txt //prompt txt such as rmo
Results after 1700 steps of training:



Original image:
Masked image:
Use flux fill with lora trained before:
Prompt: rmo
Result image: