Optuna integration for hyperparameter optimization by forklady42 · Pull Request #84 · Quantum-Accelerators/electrai

forklady42 · 2026-02-18T21:53:41Z

Optuna setup for initial coarse hyperparameter search. WandB runs are under betsy-mp-hpo-tier1.

Best trial over 15 epochs:

=== Best Trial ===
Trial number: 71
Value (val_loss): 0.033743

Best hyperparameters:
  n_channels: 32
  n_residual_blocks: 1
  kernel_size: 5
  lr: 0.003580430023049488

forklady42 · 2026-02-18T21:55:15Z

@claude review this PR

claude · 2026-02-18T21:55:50Z

Claude finished @forklady42's task in 2m 4s —— View job

PR Review: Optuna Integration for Hyperparameter Optimization

Tasks:

Gather context and read changed files
Review hpo.py entrypoint
Review YAML config files
Review pyproject.toml changes
Post comprehensive review feedback

Overall this is a solid and well-structured HPO integration. The design choices (ddp_spawn over ddp, per-trial W&B runs, optional dependency group) are all sensible. A few issues worth addressing before merge:

Bugs / Correctness

val_loss metric name ambiguity

lightning.py logs val_loss with on_step=True, on_epoch=True:

self.log("val_loss", loss, on_step=True, on_epoch=True, sync_dist=True)

In PyTorch Lightning, this creates both val_loss_step and val_loss_epoch keys during training. The plain val_loss key in callback_metrics should resolve to the epoch aggregate at epoch end, but this depends on Lightning version behavior. The ModelCheckpoint and EarlyStopping monitors, and the final trainer.callback_metrics.get("val_loss") at hpo.py:210, all rely on this. Worth verifying the logged key name matches exactly — if it resolves to val_loss_epoch instead, all three will silently fail (checkpoint won't save, early stopping won't trigger, and val_loss will be None, causing all trials to prune).

Fix this →

Issues

train_workers/val_workers commented out in tier1 config (config_hpo_tier1.yaml:49-50)

  # train_workers: 8
  # val_workers: 2

If RhoRead defaults to 0 workers, data loading will be single-threaded and a significant bottleneck across 50+ trials. The full config has these set — the tier1 config should too unless there's a deliberate reason to drop them.

Fix this →

OOM heuristic ignores batch size (hpo.py:131-140)

The memory check guards against large channel counts but doesn't account for batch_size, which is also in the full search space in config_hpo.yaml. A combination like batch_size=4, depth=4, n_channels=32 could OOM without triggering the bottleneck check. Consider incorporating batch size into the heuristic, or moving the check after batch_size is applied.

No installation hint in docstring (hpo.py:1-8)

optuna and optuna-integration are optional extras. The module-level imports on lines 19 and 26 will raise ImportError if the user runs the script without uv sync --extra hpo. The docstring at the top should note this requirement.

Fix this →

Nits

WandB reinit=True is deprecated (hpo.py:162)

In WandB SDK ≥0.18, reinit=True is deprecated in favor of reinit="finish_previous". Worth updating to avoid future breakage, though it works for now.

Checkpoint accumulation (hpo.py:169-172)

With n_trials=50, this will create 50 subdirectories under hpo_checkpoints/, each with a saved checkpoint. For a coarse search that's probably fine, but in longer runs this could consume significant disk space. Consider adding a note or cleanup step.

gradient_clip_value: 20.0 is a very high upper bound (config_hpo.yaml:51)

The search space includes [1.0, 5.0, 20.0] for gradient clipping. A value of 20.0 is effectively no clipping for most training scenarios and could destabilize learning. Consider capping at 10.0 or using [0.5, 1.0, 5.0].

sqlite:/// storage for parallel trials (config_hpo.yaml:9, config_hpo_tier1.yaml:8)

SQLite is fine for sequential trials but can have contention with n_jobs > 1. If you ever parallelize the outer study (multiple HPO processes), consider switching to PostgreSQL or RDB storage.

forklady42 · 2026-03-03T17:36:40Z

We're not currently experimenting with batch_size. It causing an OOM isn't currently a concern.

@claude re-review this PR

claude · 2026-03-03T17:37:15Z

Claude finished @forklady42's task in 1m 56s —— View job

PR Re-Review: Optuna Integration for Hyperparameter Optimization

Tasks:

Read changed files
Re-review with batch_size OOM concern dropped
Post updated review

Good progress — the previous round of comments has been well addressed. One new bug surfaced during this pass.

Addressed Since Previous Review

val_loss metric ambiguity — Fixed. validation_step now logs with on_step=False, on_epoch=True, so val_loss in callback_metrics unambiguously resolves to the epoch aggregate. All three monitors (ModelCheckpoint, EarlyStopping, callback_metrics.get) will work correctly.
train_workers/val_workers commented out — Fixed. Both are now set in config_hpo_tier1.yaml.
No installation hint in docstring — Fixed. Docstring now includes the uv sync --extra hpo note.
WandB reinit=True deprecated — Fixed. Updated to reinit="finish_previous".
OOM heuristic / batch_size — Not a concern; noted and dropped.

Bug: `src.electrai` imports in HPO files

The parent commit 8c4418a explicitly fixed all from src.electrai. imports and _target_: src.electrai. references across the codebase. The new files in this PR reintroduce the old pattern:

hpo.py:31:

from src.electrai.lightning import LightningGenerator  # broken

All other entrypoints use from electrai.lightning import LightningGenerator. The src. prefix only works if the project root is on sys.path, which isn't guaranteed when the package is installed.

config_hpo.yaml:63,78 and config_hpo_tier1.yaml:44,58:

_target_: src.electrai.dataloader.dataset.RhoRead   # broken
_target_: src.electrai.model.resunet.ResUNet3D       # broken

Compare with the existing config_resunet.yaml which correctly uses:

_target_: electrai.dataloader.dataset.RhoRead
_target_: electrai.model.resunet.ResUNet3D

Fix this →

Remaining Nits

Checkpoint accumulation — With n_trials=50, hpo_checkpoints/ and hpo_tier1_checkpoints/ will each accumulate 50 subdirectories. Fine for a coarse search, but worth a comment or a cleanup note for longer runs.

gradient_clip_value: 20.0 in config_hpo.yaml (config_hpo.yaml:51) — choices: [1.0, 5.0, 20.0] still includes 20.0, which is effectively no clipping. Minor, but could destabilize some trials.

SQLite storage for parallel outer-loop runs — No concern if n_jobs=1 (the default), but worth a note if you ever parallelize across multiple HPO processes.

Introduces HPO infrastructure for ResUNet model tuning: - New hpo.py entrypoint with Optuna objective function and pruning - config_hpo.yaml defining search space for model/training params - W&B logging for trial tracking with per-trial runs grouped by study - Memory-safe pruning for high channel count configurations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Separate config for initial hyperparameter search focusing on critical parameters: depth, n_channels, n_residual_blocks, kernel_size, and lr. Fixes Tier 2 params (warmup, weight_decay, etc.) at sensible defaults. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

HPO changes: - Use ddp_spawn instead of ddp to avoid subprocess re-launching issue where each GPU process created different Optuna trials - Add --devices flag to control GPU count per trial ResUNet fix: - Fix decoder channel mismatch when n_residual_blocks > 1 - First decoder block takes 2*ch (concatenated), subsequent take ch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

With on_step=True and on_epoch=True, Lightning creates val_loss_step and val_loss_epoch keys, making the plain val_loss key unreliable in callback_metrics. This caused ModelCheckpoint, EarlyStopping, and the HPO trial result to silently fail. Setting on_step=False ensures the metric is logged only as val_loss at epoch end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

forklady42 · 2026-03-04T20:00:19Z

Rebased and fixed the src.electrai references. Ready for review

hanaol · 2026-03-12T02:09:35Z

Thanks for looking into this Betsy. Did you happen to check larger kernel sizes like 7?

forklady42 · 2026-03-12T14:44:01Z

@hanaol I initially tested 3 and 5 for kernel size. Since there's architectural changes still in flight, I've been limiting some of the sweeps until those land.

To be clear, this PR focuses on the Optuna integration for running hyperparameter sweeps, rather than the output of the sweep itself.

hanaol · 2026-03-12T14:59:18Z

I understand. Since you reported the best trial, I was curious which kernel sizes you had explored. I can now see the [3,5] case in your code.

forklady42 · 2026-03-19T18:52:50Z

@hanaol anything you'd like addressed on this branch or should I go ahead and merge so we can keep iterating on hyperparameter sweep code?

forklady42 · 2026-04-07T15:31:13Z

@claude review once more

claude · 2026-04-07T15:31:46Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

hanaol · 2026-04-07T16:08:17Z

@forklady42 I had missed this one, reviewing now...

## Summary - Pin `anthropics/claude-code-action` to `v1.0.88` in `.github/workflows/claude.yml` - Works around anthropics/claude-code-action#1187 — v1.0.89's `restoreConfigFromBase` crashes with `ENOENT: no such file or directory, symlink` because our `CLAUDE.md` is a symlink to `AGENTS.md` (from #91) - Every `@claude` trigger has failed since v1.0.89 went out on 2026-04-05; most recent failure: https://github.com/Quantum-Accelerators/electrai/actions/runs/24089849192 ## Why v1.0.88 - Last release before the regression (#1172 / #1174 landed in v1.0.89) - Upstream fix is anthropics/claude-code-action#1186 but it is still open; once that ships we can revert the pin back to `@v1` ## Test plan Note: `@claude`-ing this PR won't test the fix — `issue_comment` workflows are always resolved from the default branch, so the broken `@v1` on `main` would still run. The test has to happen post-merge. - [ ] Merge to `main` - [ ] Retrigger `@claude review once more` on #84 (it's already failing on v1.0.89) - [ ] Confirm the workflow run gets past `Restoring ... from origin/main (PR head is untrusted)` without an `ENOENT: no such file or directory, symlink` error - [ ] Once anthropics/claude-code-action#1186 ships, revert the pin back to `@v1`

hanaol · 2026-04-07T16:44:51Z

src/electrai/entrypoints/hpo.py

+    return params
+
+
+def apply_hyperparameters(cfg_dict: dict, params: dict) -> dict:


If a new parameter is added to search_space in the YAML but not added here, it will be silently dropped. This could produce a study with no error or warning.

There could be a guard at the top:

known_params = {"depth", "n_channels", "n_residual_blocks", "kernel_size", "lr", "weight_decay", "warmup_length", "gradient_clip_value", "batch_size", "augmentation"} unknown = set(params) - known_params if unknown: raise ValueError(f"Unknown hyperparameters with no config mapping: {unknown}")

hanaol · 2026-04-07T17:00:46Z

src/electrai/entrypoints/hpo.py

+        depth = trial_cfg_dict["model"]["depth"]
+        n_channels = trial_cfg_dict["model"]["n_channels"]
+        max_channels_at_bottleneck = n_channels * (2**depth)
+        if max_channels_at_bottleneck > 512:


Curious how you arrived at the heuristic. Do you think the spatial dimensions of the input grid and precision should also factor in?

hanaol · 2026-04-07T17:04:32Z

src/electrai/lightning.py

        self.log(
-            "val_loss", loss, prog_bar=True, on_step=True, on_epoch=True, sync_dist=True
+            "val_loss",
+            loss,


Makes sense for HPO since on_step=True creates two metrics (val_loss and val_loss_epoch) which can confuse the monitoring callbacks. That said, per-step validation tracking is still useful during early model development to catch instabilities mid-epoch. Would it be worth making this configurable (e.g. a flag in the config) so we can track it closely while we're still iterating on the architecture?

hanaol

Minor comments. Feel free to merge.

forklady42 mentioned this pull request Feb 25, 2026

Hyperparameter tuning #88

Open

5 tasks

forklady42 force-pushed the betsy/resunet-hpo branch from 65514bb to 23a0aa4 Compare February 25, 2026 18:57

forklady42 and others added 7 commits March 3, 2026 12:43

Tier1 hyperparameter search config

4f2b16e

Address comments

44c949a

Update src.electrai imports to electrai to match parent commit

0482924

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

forklady42 force-pushed the betsy/resunet-hpo branch from 2a91af7 to 0482924 Compare March 3, 2026 17:47

forklady42 requested a review from hanaol March 4, 2026 20:00

forklady42 marked this pull request as ready for review March 4, 2026 20:00

Merge branch 'main' into betsy/resunet-hpo

987cf2e

forklady42 mentioned this pull request Apr 7, 2026

Pin claude-code-action to v1.0.88 to dodge symlink ENOENT #117

Merged

4 tasks

hanaol reviewed Apr 7, 2026

View reviewed changes

hanaol approved these changes Apr 7, 2026

View reviewed changes

		return params


		def apply_hyperparameters(cfg_dict: dict, params: dict) -> dict:

Conversation

forklady42 commented Feb 18, 2026

Uh oh!

forklady42 commented Feb 18, 2026

Uh oh!

claude bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Optuna Integration for Hyperparameter Optimization

Bugs / Correctness

Issues

Nits

Uh oh!

forklady42 commented Mar 3, 2026

Uh oh!

claude bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Re-Review: Optuna Integration for Hyperparameter Optimization

Addressed Since Previous Review

Bug: src.electrai imports in HPO files

Remaining Nits

Uh oh!

forklady42 commented Mar 4, 2026

Uh oh!

hanaol commented Mar 12, 2026

Uh oh!

forklady42 commented Mar 12, 2026

Uh oh!

hanaol commented Mar 12, 2026

Uh oh!

forklady42 commented Mar 19, 2026

Uh oh!

forklady42 commented Apr 7, 2026

Uh oh!

claude bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanaol commented Apr 7, 2026

Uh oh!

hanaol Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

hanaol Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

hanaol Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanaol left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Feb 18, 2026 •

edited

Loading

claude bot commented Mar 3, 2026 •

edited

Loading

Bug: `src.electrai` imports in HPO files

claude bot commented Apr 7, 2026 •

edited

Loading

hanaol Apr 7, 2026 •

edited

Loading