Skip to content

[Bug]: NaN error when training vpred model with generalized offset noise enabled #1389

@yamatazen

Description

@yamatazen

What happened?

This error happens if the base model is a vpred model.

Steps to reproduce:

  1. Train LoRA on a vpred model.
  2. Stop training.
  3. Restart OneTrainer.
  4. Resume training from backup.
  5. Wait for sampling.

What did you expect would happen?

Train without NaN error.

Relevant log output

Continuing training from backup 'C:/Users/yamat/Documents/OneTrainer/vpred\backup\2026-03-24_19-05-55-backup-270-8-6'...
Fetching 17 files: 100%|███████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 245028.07it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████| 7/7 [00:08<00:00,  1.16s/it]
Selected layers: 722
Deselected layers: 72
Note: Enable Debug mode to see the full list of layer names
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 23.07it/s]
enumerating sample paths:   0%|                                                                  | 0/1 [00:00<?, ?it/s]W0324 20:56:29.563000 14008 venv\Lib\site-packages\torch\_inductor\utils.py:1613] [0/0] Not enough SMs to use max_autotune_gemm mode
step: 100%|█████████████████████████████████████████████| 33/33 [04:05<00:00,  9.08s/it, loss=0.139, smooth loss=0.155]
caching: 100%|███████████████████████████████████████████████████████████████████████| 145/145 [00:07<00:00, 18.35it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 30/30 [00:25<00:00,  1.19it/s]
Creating Backup C:/Users/yamat/Documents/OneTrainer/vpred\backup\2026-03-24_21-01-29-backup-298-9-1
step:  12%|█████▌                                        | 4/33 [01:24<10:10, 21.06s/it, loss=0.186, smooth loss=0.157]
epoch:   8%|██████                                                                   | 1/12 [05:45<1:03:23, 345.78s/it]
Traceback (most recent call last):
  File "C:\Users\yamat\Desktop\OneTrainer\modules\ui\TrainUI.py", line 719, in __training_thread_function
    trainer.train()
    ~~~~~~~~~~~~~^^
  File "C:\Users\yamat\Desktop\OneTrainer\modules\trainer\GenericTrainer.py", line 796, in train
    raise RuntimeError("Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.")
RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.
Creating Backup C:/Users/yamat/Documents/OneTrainer/vpred\backup\2026-03-24_21-02-08-backup-301-9-4
Saving C:/Users/yamat/Documents/OneTrainer/vpred/vpred.safetensors

Generate and upload debug_report.log

=== System Information ===
OS: Windows 11
Version: 10.0.26200

=== Hardware Information ===
CPU: 12th Gen Intel(R) Core(TM) i7-12700F (Cores: 12)
Total RAM: 15.76 GB

=== GPU Information ===
NVIDIA GPU (Index 0): NVIDIA GeForce RTX 3060 [NVIDIA]
Driver version: 595.79
Power Limit: 170.00 W

=== Python Environment ===
Global Python Version: 3.13.12
Python Executable Path: C:\Users\anonymous\Desktop\OneTrainer\venv\Scripts\python.exe
PyTorch Info: torch==2.9.1+cu128
pip freeze output:
absl-py==2.4.0
accelerate==1.12.0
adv_optm==2.2.3
aiodns==4.0.0
aiohappyeyeballs==2.6.1
aiohttp==3.13.3
aiohttp-retry==2.9.1
aiosignal==1.4.0
annotated-doc==0.0.4
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.12.1
attrs==26.1.0
av==16.1.0
backoff==2.2.1
backports.zstd==1.3.0
bcrypt==5.0.0
bitsandbytes==0.49.1
boto3==1.42.72
botocore==1.42.72
brotli==1.2.0
certifi==2026.2.25
cffi==2.0.0
charset-normalizer==3.4.6
click==8.2.1
cloudpickle==3.1.2
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.3.3
cryptography==45.0.7
customtkinter==5.2.2
cycler==0.12.1
dadaptation==3.2
darkdetect==0.8.0
decorator==5.2.1
deepdiff==8.6.1
Deprecated==1.3.1
-e git+https://github.com/huggingface/diffusers.git@99daaa802da01ef4cff5141f4f3c0329a57fb591#egg=diffusers
dnspython==2.8.0
email-validator==2.3.0
fabric==3.2.2
fastapi==0.135.1
fastapi-cli==0.0.24
fastapi-cloud-cli==0.15.0
fastar==0.8.0
filelock==3.25.2
flatbuffers==25.12.19
fonttools==4.62.1
frozenlist==1.8.0
fsspec==2026.2.0
ftfy==6.3.1
gguf==0.17.1
grpcio==1.78.1
h11==0.16.0
httpcore==1.0.9
httptools==0.7.1
httpx==0.28.1
huggingface-hub==0.34.4
humanfriendly==10.0
idna==3.11
imagesize==1.4.1
importlib_metadata==9.0.0
inquirerpy==0.3.4
invisible-watermark==0.2.0
invoke==2.2.1
itsdangerous==2.2.0
Jinja2==3.1.6
jmespath==1.1.0
kiwisolver==1.5.0
lightning-utilities==0.15.3
lion-pytorch==0.2.3
Markdown==3.10.2
markdown-it-py==4.0.0
MarkupSafe==3.0.3
matplotlib==3.10.3
mdurl==0.1.2
-e git+https://github.com/Nerogar/mgds.git@a25b59f7619da99fdc6f8e8d5a0d89be519a4671#egg=mgds
mpmath==1.3.0
multidict==6.7.1
-e git+https://github.com/KellerJordan/Muon.git@f90a42b28e00b8d9d2d05865fe90d9f39abcbcbd#egg=muon_optimizer
networkx==3.6.1
numpy==2.2.6
nvidia-ml-py==13.595.45
omegaconf==2.3.0
onnxruntime-gpu==1.23.2
open_clip_torch==2.32.0
opencv-python==4.11.0.86
orderly-set==5.5.0
orjson==3.11.7
packaging==26.0
paramiko==4.0.0
parse==1.20.2
pfzy==0.3.4
pillow==12.1.1
platformdirs==4.9.4
pooch==1.8.2
prettytable==3.17.0
prodigy-plus-schedule-free==2.0.1
prodigyopt==1.1.2
prompt_toolkit==3.0.52
propcache==0.4.1
protobuf==7.34.0
psutil==7.0.0
py-cpuinfo==9.0.0
pycares==5.0.1
pycparser==3.0
pydantic==2.12.5
pydantic-extra-types==2.11.1
pydantic-settings==2.13.1
pydantic_core==2.41.5
Pygments==2.19.2
PyNaCl==1.6.2
pyparsing==3.3.2
pyreadline3==3.5.4
python-dateutil==2.9.0.post0
python-dotenv==1.2.2
python-multipart==0.0.22
pytorch-lightning==2.6.1
pytorch_optimizer==3.6.0
PyWavelets==1.9.0
PyYAML==6.0.2
regex==2026.2.28
requests==2.32.5
rich==14.3.3
rich-toolkit==0.19.7
rignore==0.7.6
runpod==1.7.10
s3transfer==0.16.0
safetensors==0.7.0
scalene==1.5.51
scenedetect==0.6.7.1
schedulefree==1.4.1
scipy==1.15.3
sentencepiece==0.2.1
sentry-sdk==2.55.0
setuptools==81.0.0
shellingham==1.5.4
six==1.17.0
starlette==0.52.1
sympy==1.14.0
tensorboard==2.20.0
tensorboard-data-server==0.7.2
timm==1.0.25
tokenizers==0.22.2
tomli==2.4.0
tomlkit==0.14.0
torch==2.9.1+cu128
torchmetrics==1.9.0
torchvision==0.24.1+cu128
tqdm==4.67.1
tqdm-loggable==0.4.1
transformers==4.57.6
triton-windows==3.5.1.post24
typer==0.24.1
typing-inspection==0.4.2
typing_extensions==4.15.0
ujson==5.11.0
urllib3==2.6.3
uvicorn==0.42.0
watchdog==6.0.0
watchfiles==1.1.1
wcwidth==0.6.0
websockets==16.0
Werkzeug==3.1.6
wheel==0.46.3
wrapt==2.1.2
yarl==1.23.0
yt-dlp==2026.3.17
zipp==3.23.0

=== Git Information ===
Repo: Nerogar/OneTrainer
Branch: master
Commit: cb6cab2
No deleted, unmerged, or modified files relative to origin/master.

=== Network Connectivity ===
PyPI (https://pypi.org/): Failure: expected string or bytes-like object, got 'NoneType'
HuggingFace (https://huggingface.co): Failure: expected string or bytes-like object, got 'NoneType'
Google (https://www.google.com): Failure: expected string or bytes-like object, got 'NoneType'

=== Intel Microcode Information ===
CPU is not detected as 13th or 14th Gen Intel - microcode info not applicable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinginvalidThis doesn't seem right

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions