Skip to content

Switch torch dependency from ~=2.9.1 to ~=2.10.0 (silent bfloat16 memory regression)#118

Open
hanaol wants to merge 3 commits intomainfrom
hanaol/torch-version-upgrade
Open

Switch torch dependency from ~=2.9.1 to ~=2.10.0 (silent bfloat16 memory regression)#118
hanaol wants to merge 3 commits intomainfrom
hanaol/torch-version-upgrade

Conversation

@hanaol
Copy link
Copy Markdown
Collaborator

@hanaol hanaol commented Apr 9, 2026

Summary

This PR updates the torch dependency from ~=2.9.1 to ~=2.10.0 to fix a silent bfloat16 memory regression introduced in torch 2.9.0.

The problem

torch 2.9.0 and 2.9.1 contain a cuDNN regression that inflates the nn.Conv3d bfloat16 forward-pass workspace by 26x -- from ~77 MB to ~2,053 MB -- relative to both the preceding (2.8.0) and following (2.10.0) releases. These numbers were measured on a fixed tensor of shape [1, 32, 64, 64, 64] with a Conv3d(in=32, out=32, k=5, padding=2) layer. float32 memory is completely unaffected (stable at ~123 MB across all versions), confirming the bug is specific to the bfloat16 cuDNN kernel selection path.

This matters because we use (or plan to use) bf16-mixed precision training. This regression would silently consume an extra ~2 GB per Conv3d layer, directly undermining the memory savings that bf16 is supposed to provide -- without any crash or warning.

This issue has been raised in the PyTorch community:

Benchmark results (A100-SXM4-80GB, CUDA 12.8, input shape [1, 32, 64, 64, 64])

Peak GPU memory — float32

torch cuDNN Fwd peak (MB) Bwd peak (MB)
2.8.0 9.1.0.2 123 212
2.9.0 9.1.0.2 123 212
2.9.1 9.1.0.2 123 212
2.10.0 9.1.0.2 123 212
2.11.0 9.1.9.0 123 209

Peak GPU memory — bfloat16

torch cuDNN Fwd peak (MB) Bwd peak (MB)
2.8.0 9.1.0.2 77 111
2.9.0 9.1.0.2 2053 2081
2.9.1 9.1.0.2 2053 2081
2.10.0 9.1.0.2 77 111
2.11.0 9.1.9.0 77 111

The benchmark script is included at scripts/benchmark_conv3d_memory.py and can be run standalone on any CUDA node.

Decision: 2.10.0 vs 2.11.0

Both 2.10.0 and 2.11.0 are clean. This PR pins to 2.10.0 for now. Upgrading to 2.11.0 is possible but introduces a CUDA 13.0 dependency (vs 12.8 for all prior versions), which pulls in a new set of nvidia-*-cu13 libraries and we have not tested it against our full stack (lightning, etc.). Once our ecosystem catches up to CUDA 13.0, bumping to ~=2.11.0 is an option worth revisiting.

Files changed

  • pyproject.toml -- torch~=2.9.1 to torch~=2.10.0
  • uv.lock -- regenerated
  • scripts/benchmark_conv3d_memory.py -- standalone benchmark used to produce the results above

@hanaol
Copy link
Copy Markdown
Collaborator Author

hanaol commented Apr 9, 2026

@claude review this PR 👀 👀 👀

@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

Claude finished @hanaol's task in 2m 33s —— View job


PR Review: Switch torch ~=2.9.1 → ~=2.10.0

  • Gather context and read changed files
  • Review pyproject.toml change
  • Review benchmark script
  • Review uv.lock changes
  • Post final review

Summary: The fix is correct and well-motivated. The bfloat16 Conv3d regression in torch 2.9.x is real (pytorch/pytorch#166643), and 2.10.0 cleanly resolves it. The benchmark methodology is solid. A few things worth flagging below.


pyproject.toml

The one-line change is correct:

-"torch~=2.9.1",
+"torch~=2.10.0",

~=2.10.0 (compatible release) pins to >=2.10.0, <2.11.0, which is exactly the right constraint — narrow enough to avoid 2.11's CUDA 13.0 dependency, wide enough to pick up any 2.10.x patch. No issues here.


benchmark script (scripts/benchmark_conv3d_memory.py)

Overall well-written. A few observations:

Good:

  • Clear module docstring and inline comments
  • Warmup iterations before measurement — correct practice
  • Averages over BENCH_ITERS=5 for stable timing
  • cuDNN version decoding handles both the pre-9 and post-9 integer formats correctly
  • Uses Path.open() and proper JSON output
  • Structured results with full GPU metadata (SM count, VRAM)

Minor issues:

  1. --output is required (line 106) — the script exits with code 1 when CUDA is unavailable (correct), but fails with an argparse error if --output is omitted. Consider a sensible default like results.json so it can be run with just python benchmark_conv3d_memory.py. Low severity.

  2. Backward memory semantics (lines 79–88): reset_peak_memory_stats() resets the peak counter to zero (not to the current allocation level). This means bwd_peak_mb captures the total peak memory during the backward pass — including the activations already allocated by the fresh out = conv(x) forward. So it's not purely the backward-specific allocation. The table in the PR description labels the columns as "bf16 bwd (MB)" which is slightly ambiguous, but since the methodology is consistent across all torch versions, the relative comparison is valid. Worth adding a comment in the script clarifying this.

  3. out from forward loop leaks into backward loop (line 81): The last out tensor from the forward benchmark loop is still live when the backward loop starts — but it gets immediately overwritten by out = conv(x), so there's no actual memory leak. Just a readability nit.


uv.lock

The changes look exactly right: torch 2.9.1 wheel hashes replaced with torch 2.10.0 equivalents across all platform/Python combinations. The build suffix changed from plain torch-2.9.1-... to torch-2.10.0-2-... and torch-2.10.0-3-... (PyPI build numbers), which is expected.


Commit history

The branch has three commits but two of them are noisy:

  • 45f70f6 — "modules to check later": adds cudnn.benchmark = False + cudnn.deterministic = True to main.py, plus the actual pyproject.toml/uv.lock change
  • b11cc87 — "modified main.py": immediately reverts those two main.py lines

The net effect on main.py is zero, but the history carries exploratory cruft with unhelpful messages. This is fine for a feature branch, but squashing to a single clean commit before merge would make the history easier to read. Not a blocker.


Verdict

Approve with minor suggestions. The core change is correct, minimal, and backed by reproducible evidence. The benchmark script is a valuable addition. The commit history and --output default are polish items, not blockers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant