Skip to content

train_dreambooth_lora.py -- ValueError: Attempting to unscale FP16 gradients caused by "--validation_prompt" param. #13124

@Xjmengnieer

Description

@Xjmengnieer

Describe the bug

Hello, when I directly run the training script provided in examples/dreambooth, with the following command, I encounter the “Attempting to unscale FP16 gradients” error:

`accelerate launch examples/dreambooth/train_dreambooth_lora.py \
  --mixed_precision="fp16" \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=100 \
  --learning_rate=1e-4 \
  --report_to="tensorboard" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --seed="0" \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=50`

I then tried the solution proposed in PR #6554 and found that the following command does not trigger this issue:

> > `accelerate launch examples/dreambooth/train_dreambooth_lora.py \
> >   --mixed_precision="fp16" \
> >   --pretrained_model_name_or_path=$MODEL_NAME \
> >   --instance_data_dir=$INSTANCE_DIR \
> >   --output_dir=$OUTPUT_DIR \
> >   --instance_prompt="a photo of sks dog" \
> >   --resolution=512 \
> >   --train_batch_size=1 \
> >   --gradient_accumulation_steps=1 \
> >   --checkpointing_steps=100 \
> >   --learning_rate=1e-4 \
> >   --lr_scheduler="constant" \
> >   --lr_warmup_steps=0 \
> >   --max_train_steps=500 \
> >   --gradient_checkpointing \
> >   --seed="0" \
> >   --report_to="tensorboard"`
> 

I then tried to align the two commands and noticed that the issue is triggered specifically when the --validation_prompt argument is provided.

Does this indicate that there is still a bug related to validation in this script?

Reproduction

`accelerate launch examples/dreambooth/train_dreambooth_lora.py \
  --mixed_precision="fp16" \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=100 \
  --learning_rate=1e-4 \
  --report_to="tensorboard" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --seed="0" \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=50`

Logs

System Info

  • 🤗 Diffusers version: 0.37.0.dev0
  • Platform: Linux-5.4.0-216-generic-x86_64-with-glibc2.35
  • Running on Google Colab?: No
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.9.1+cu128 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 1.4.1
  • Transformers version: 5.1.0
  • Accelerate version: 1.12.0
  • PEFT version: 0.18.1
  • Bitsandbytes version: not installed
  • Safetensors version: 0.7.0
  • xFormers version: not installed
  • Accelerator: NVIDIA GeForce RTX 3060, 12288 MiB
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions