Skip to content

optimizer: allow resume of short-completed runs#143

Open
ZhengyaoJiang wants to merge 1 commit intodevfrom
vk/resume-short-completed
Open

optimizer: allow resume of short-completed runs#143
ZhengyaoJiang wants to merge 1 commit intodevfrom
vk/resume-short-completed

Conversation

@ZhengyaoJiang
Copy link
Copy Markdown
Contributor

Summary

Allow `weco resume` to recover runs that the backend marked `completed` short of their step budget. Previously these were stuck — the only escape was a fresh run.

Why

A transient `Failed to submit result` on the CLI side can race with a successful backend ack: the backend records the result and marks the run `completed` (its own step counter), but the CLI exits with `submit_failed` well short of `total_steps`. Subsequent `weco resume` returns `Run cannot be resumed (status: completed)` and the user has to start over.

Discovered during a multi-day fraud-detection IEEE-CIS rerun: ~5 of 18 cells "completed" at 60–130 steps out of 200 because of intermittent network blips.

Change

`resume_optimization()` now treats `status='completed' AND current_step < total_steps` ("short-completed") as resumable. The run is flipped back to `running` by `resume_optimization_run()` the same way it always was, and the queue loop picks up at the prepared `start_step`.

Test plan

  • Existing `error`/`terminated` resume paths still work (no regression).
  • On a run with `status='completed', current_step=60, steps=200`, `weco resume` prompts and continues at step 60.
  • On a run with `status='completed', current_step=200, steps=200` (truly done), resume is still rejected with the `completed/total_steps` message — no spurious re-runs.

🤖 Generated with Claude Code

…_steps

A transient `Failed to submit result` on the CLI side can race with a
successful backend ack: the backend records the result and marks the run
"completed" (its own step counter), but the CLI exits with submit_failed
well short of the configured step budget. Subsequent `weco resume`
returns "Run cannot be resumed (status: completed)" and the only
recovery is to start a new run from scratch — losing the step history.

resume_optimization() now treats `status='completed' AND current_step <
total_steps` ("short-completed") as resumable, alongside the existing
"error"/"terminated" cases. The run status is bumped back to "running"
by resume_optimization_run() the same way it always was, and the queue
loop picks up at the prepared start_step.

The branch where this matters in practice: a multi-day Weco run on
fraud-detection IEEE-CIS, where the network blip during submit caused
~5 of 18 cells to "complete" at 60-130 steps out of 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49bf182a0b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread weco/optimizer.py
Comment on lines 447 to 448
current_step = int(status.get("current_step", 0))
steps_remaining = int(total_steps) - current_step
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard step parsing before checking resumable status

This change moved the resumable-status gate below current_step/total_steps parsing, so weco resume now attempts int(...) conversion even for runs that should be rejected immediately. If the API returns current_step or optimizer.steps as null (which can happen on older/incomplete run metadata), int(None) raises and the command crashes instead of printing the non-resumable status message; previously completed runs never hit this path because they were rejected first.

Useful? React with 👍 / 👎.

@aliroberts
Copy link
Copy Markdown
Contributor

Hi @ZhengyaoJiang, do you have an example run ID here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants