Optuna integration for hyperparameter optimization#84
Optuna integration for hyperparameter optimization#84forklady42 wants to merge 8 commits intomainfrom
Conversation
|
@claude review this PR |
|
Claude finished @forklady42's task in 2m 4s —— View job PR Review: Optuna Integration for Hyperparameter OptimizationTasks:
Overall this is a solid and well-structured HPO integration. The design choices (ddp_spawn over ddp, per-trial W&B runs, optional dependency group) are all sensible. A few issues worth addressing before merge: Bugs / Correctness
self.log("val_loss", loss, on_step=True, on_epoch=True, sync_dist=True)In PyTorch Lightning, this creates both Issues
# train_workers: 8
# val_workers: 2If OOM heuristic ignores batch size ( The memory check guards against large channel counts but doesn't account for No installation hint in docstring (
NitsWandB In WandB SDK ≥0.18, Checkpoint accumulation ( With
The search space includes
SQLite is fine for sequential trials but can have contention with |
65514bb to
23a0aa4
Compare
|
We're not currently experimenting with @claude re-review this PR |
|
Claude finished @forklady42's task in 1m 56s —— View job PR Re-Review: Optuna Integration for Hyperparameter OptimizationTasks:
Good progress — the previous round of comments has been well addressed. One new bug surfaced during this pass. Addressed Since Previous Review
Bug:
|
Introduces HPO infrastructure for ResUNet model tuning: - New hpo.py entrypoint with Optuna objective function and pruning - config_hpo.yaml defining search space for model/training params - W&B logging for trial tracking with per-trial runs grouped by study - Memory-safe pruning for high channel count configurations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Separate config for initial hyperparameter search focusing on critical parameters: depth, n_channels, n_residual_blocks, kernel_size, and lr. Fixes Tier 2 params (warmup, weight_decay, etc.) at sensible defaults. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
HPO changes: - Use ddp_spawn instead of ddp to avoid subprocess re-launching issue where each GPU process created different Optuna trials - Add --devices flag to control GPU count per trial ResUNet fix: - Fix decoder channel mismatch when n_residual_blocks > 1 - First decoder block takes 2*ch (concatenated), subsequent take ch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
With on_step=True and on_epoch=True, Lightning creates val_loss_step and val_loss_epoch keys, making the plain val_loss key unreliable in callback_metrics. This caused ModelCheckpoint, EarlyStopping, and the HPO trial result to silently fail. Setting on_step=False ensures the metric is logged only as val_loss at epoch end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2a91af7 to
0482924
Compare
|
Rebased and fixed the |
|
Thanks for looking into this Betsy. Did you happen to check larger kernel sizes like 7? |
|
@hanaol I initially tested 3 and 5 for kernel size. Since there's architectural changes still in flight, I've been limiting some of the sweeps until those land. To be clear, this PR focuses on the Optuna integration for running hyperparameter sweeps, rather than the output of the sweep itself. |
|
I understand. Since you reported the best trial, I was curious which kernel sizes you had explored. I can now see the [3,5] case in your code. |
|
@hanaol anything you'd like addressed on this branch or should I go ahead and merge so we can keep iterating on hyperparameter sweep code? |
|
@claude review once more |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
@forklady42 I had missed this one, reviewing now... |
## Summary - Pin `anthropics/claude-code-action` to `v1.0.88` in `.github/workflows/claude.yml` - Works around anthropics/claude-code-action#1187 — v1.0.89's `restoreConfigFromBase` crashes with `ENOENT: no such file or directory, symlink` because our `CLAUDE.md` is a symlink to `AGENTS.md` (from #91) - Every `@claude` trigger has failed since v1.0.89 went out on 2026-04-05; most recent failure: https://github.com/Quantum-Accelerators/electrai/actions/runs/24089849192 ## Why v1.0.88 - Last release before the regression (#1172 / #1174 landed in v1.0.89) - Upstream fix is anthropics/claude-code-action#1186 but it is still open; once that ships we can revert the pin back to `@v1` ## Test plan Note: `@claude`-ing this PR won't test the fix — `issue_comment` workflows are always resolved from the default branch, so the broken `@v1` on `main` would still run. The test has to happen post-merge. - [ ] Merge to `main` - [ ] Retrigger `@claude review once more` on #84 (it's already failing on v1.0.89) - [ ] Confirm the workflow run gets past `Restoring ... from origin/main (PR head is untrusted)` without an `ENOENT: no such file or directory, symlink` error - [ ] Once anthropics/claude-code-action#1186 ships, revert the pin back to `@v1`
| return params | ||
|
|
||
|
|
||
| def apply_hyperparameters(cfg_dict: dict, params: dict) -> dict: |
There was a problem hiding this comment.
If a new parameter is added to search_space in the YAML but not added here, it will be silently dropped. This could produce a study with no error or warning.
There could be a guard at the top:
known_params = {"depth", "n_channels", "n_residual_blocks", "kernel_size",
"lr", "weight_decay", "warmup_length", "gradient_clip_value",
"batch_size", "augmentation"}
unknown = set(params) - known_params
if unknown:
raise ValueError(f"Unknown hyperparameters with no config mapping: {unknown}")| depth = trial_cfg_dict["model"]["depth"] | ||
| n_channels = trial_cfg_dict["model"]["n_channels"] | ||
| max_channels_at_bottleneck = n_channels * (2**depth) | ||
| if max_channels_at_bottleneck > 512: |
There was a problem hiding this comment.
Curious how you arrived at the heuristic. Do you think the spatial dimensions of the input grid and precision should also factor in?
| self.log( | ||
| "val_loss", loss, prog_bar=True, on_step=True, on_epoch=True, sync_dist=True | ||
| "val_loss", | ||
| loss, |
There was a problem hiding this comment.
Makes sense for HPO since on_step=True creates two metrics (val_loss and val_loss_epoch) which can confuse the monitoring callbacks. That said, per-step validation tracking is still useful during early model development to catch instabilities mid-epoch. Would it be worth making this configurable (e.g. a flag in the config) so we can track it closely while we're still iterating on the architecture?
hanaol
left a comment
There was a problem hiding this comment.
Minor comments. Feel free to merge.
Optuna setup for initial coarse hyperparameter search. WandB runs are under
betsy-mp-hpo-tier1.Best trial over 15 epochs: