optimizer: allow resume of short-completed runs#143
optimizer: allow resume of short-completed runs#143ZhengyaoJiang wants to merge 1 commit intodevfrom
Conversation
…_steps
A transient `Failed to submit result` on the CLI side can race with a
successful backend ack: the backend records the result and marks the run
"completed" (its own step counter), but the CLI exits with submit_failed
well short of the configured step budget. Subsequent `weco resume`
returns "Run cannot be resumed (status: completed)" and the only
recovery is to start a new run from scratch — losing the step history.
resume_optimization() now treats `status='completed' AND current_step <
total_steps` ("short-completed") as resumable, alongside the existing
"error"/"terminated" cases. The run status is bumped back to "running"
by resume_optimization_run() the same way it always was, and the queue
loop picks up at the prepared start_step.
The branch where this matters in practice: a multi-day Weco run on
fraud-detection IEEE-CIS, where the network blip during submit caused
~5 of 18 cells to "complete" at 60-130 steps out of 200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 49bf182a0b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| current_step = int(status.get("current_step", 0)) | ||
| steps_remaining = int(total_steps) - current_step |
There was a problem hiding this comment.
Guard step parsing before checking resumable status
This change moved the resumable-status gate below current_step/total_steps parsing, so weco resume now attempts int(...) conversion even for runs that should be rejected immediately. If the API returns current_step or optimizer.steps as null (which can happen on older/incomplete run metadata), int(None) raises and the command crashes instead of printing the non-resumable status message; previously completed runs never hit this path because they were rejected first.
Useful? React with 👍 / 👎.
|
Hi @ZhengyaoJiang, do you have an example run ID here? |
Summary
Allow `weco resume` to recover runs that the backend marked `completed` short of their step budget. Previously these were stuck — the only escape was a fresh run.
Why
A transient `Failed to submit result` on the CLI side can race with a successful backend ack: the backend records the result and marks the run `completed` (its own step counter), but the CLI exits with `submit_failed` well short of `total_steps`. Subsequent `weco resume` returns `Run cannot be resumed (status: completed)` and the user has to start over.
Discovered during a multi-day fraud-detection IEEE-CIS rerun: ~5 of 18 cells "completed" at 60–130 steps out of 200 because of intermittent network blips.
Change
`resume_optimization()` now treats `status='completed' AND current_step < total_steps` ("short-completed") as resumable. The run is flipped back to `running` by `resume_optimization_run()` the same way it always was, and the queue loop picks up at the prepared `start_step`.
Test plan
🤖 Generated with Claude Code