Skip to content

[WIP] Integrate PostTrainBench#254

Open
lewtun wants to merge 50 commits into
mainfrom
post-train-bench-lewis
Open

[WIP] Integrate PostTrainBench#254
lewtun wants to merge 50 commits into
mainfrom
post-train-bench-lewis

Conversation

@lewtun
Copy link
Copy Markdown
Member

@lewtun lewtun commented May 11, 2026

No description provided.

lewtun and others added 30 commits April 28, 2026 08:18
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Use Codex CLI judging, split solve and eval images, derive solve budgets from num_hours, isolate per-task HF caches, and add prompt/logging/model-validation safeguards.

Co-authored-by: Codex <codex@openai.com>
Preinstall CUDA PyTorch before vLLM and add wheel/setup tooling before flash-attn so the eval container resolves and builds successfully.

Co-authored-by: Codex <codex@openai.com>
Pass global Codex options before exec so the PostTrainBench judge runs with the installed CLI, and create CODEX_HOME before invocation.

Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Run integrity checks and artifact collection from a trusted runner copy, preserve failed-run evidence, record reproducibility hashes, and add factor-weighted aggregation with variance reporting.

Co-authored-by: Codex <codex@openai.com>
Keep the solve source mount read-only while installing a non-editable wheel before the measured solve phase.

Co-authored-by: Codex <codex@openai.com>
After moving installation outside the solve measurement, default the solve timeout to the declared task duration instead of duration plus runner grace time.

Co-authored-by: Codex <codex@openai.com>
Use a 10-minute smoke budget now that the measured solve timeout no longer includes hidden runner grace time.

Co-authored-by: Codex <codex@openai.com>
lewtun and others added 19 commits May 1, 2026 17:00
Add a smoke-only baseline final_model fallback after protected-file verification so smoke runs exercise judge, validation, evaluation, artifact, and hash paths without relying on the short-budget agent producing a model. Full runs remain unchanged.

Co-authored-by: Codex <codex@openai.com>
Add strict artifact validation mode and explicit reprompt labeling for PostTrainBench runs.

Co-authored-by: OpenAI Codex <codex@openai.com>
Add a small one-task-per-model validation matrix for checking final_model artifact creation before larger runs.

Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Stop running the local post-run secret scanner in the PTB wrapper so benchmark rows are judged by PTB integrity, final-model precheck, and metrics only. Remove secret-scan artifact collection/docs and obsolete tests.

Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
@lewtun lewtun marked this pull request as ready for review May 14, 2026 11:03
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Claude finished @lewtun's task in 6m 20s —— View job


No blocking issues — 2 P1

Verdict: ready to merge

What I checked

  • post_train_bench/integrity.py — judge classification, protected-files snapshot/verify, final-model precheck, manifest path traversal handling
  • post_train_bench/run_judge.py — Codex CLI invocation, auth file mode, prompt-injection hardening, missing-output detection
  • post_train_bench/aggregate_results.py — baseline fallback wiring, expected-cell vs scored-cell accounting, variance summary
  • post_train_bench/run_task_docker.sh & submit_eval_set.sh & launch.slurm — container mounts, env propagation, immutable-image gate, dirty-worktree gate, matrix parsing
  • agent/main.py headless reprompt loop and --config plumbing
  • agent/core/agent_loop.py streaming retry (new mid-stream retry behavior) and jittered backoff
  • agent/config.py, agent/context_manager/manager.py, agent/core/{session,tools,redact}.py, agent/tools/{local_tools,sandbox_client}.py
  • Test coverage for new behavior (integrity, aggregate, judge, collect_artifacts, redact, web_search disabled tools, CLI config-forwarding)

P1 — Streaming retries can replay assistant content to the consumer

agent/core/agent_loop.py:866-1016 now wraps the entire streaming response (both acompletion setup and the async for chunk in response iteration) inside the retry loop, with a 7-step backoff up to _LLM_RETRY_DELAYS = [15, 30, 60, 120, 300, 600, 600]. tool_calls_acc/full_content are correctly reset per attempt (agent_loop.py:867-872) so the final LLMResult is consistent.

What is not rolled back on retry are the assistant_chunk events already emitted via session.send_event(Event(event_type="assistant_chunk", ...)) at agent_loop.py:902-909 before the failure. For mid-stream errors (the new test at tests/unit/test_llm_error_classification.py:73-80 documents MidStreamFallbackError ... AnthropicError - Overloaded as transient), the consumer has already written the partial response to the terminal/log. After the retry succeeds, the full reply is emitted again, so interactive CLI users will see the start of the response duplicated. For headless PTB this is just noise in solve_out.txt; for agent/main.py:1200-style interactive use it’s a visible regression.

If this is acceptable, consider at least emitting a single tool_log marker on retry so the user can spot where the rewind happened. Otherwise it may be worth resetting stream_buf from the agent loop on retry by emitting a sentinel event (e.g., assistant_stream_reset) and handling it in agent/main.py:1573-1577.


P1 — Redact regex no longer scrubs bare/lowercase secret-y env names

agent/core/redact.py:38-41 changed from an explicit (?i)\b(HF_TOKEN|...|PASSWORD|SECRET|API_KEY)\s*[:=]\s*([^\s\"']+) to a case-sensitive prefix-required pattern: \b([A-Z][A-Z0-9_]*(?:TOKEN|SECRET|PASSWORD|API_KEY|ACCESS_KEY)[A-Z0-9_]*)\s*[:=]\s*([^\s\"']+). The leading [A-Z][A-Z0-9_]* is mandatory ([A-Z] requires one char), so:

  • SECRET=hunter2, PASSWORD=hunter2, TOKEN=abc, API_KEY=abc (bare uppercase) — no longer matched
  • password=hunter2, secret=hunter2 (lowercase, old (?i) flag is gone) — no longer matched
  • MY_SECRET=hunter2, INFERENCE_TOKEN=..., AWS_ACCESS_KEY_ID=... — still matched (covered by new tests in tests/unit/test_redact.py:54-63)

The value-side patterns (hf_…, sk-ant-…, gh[pousr]_…, AKIA…) still fire if the secret matches one of those formats, but a free-form PASSWORD=plain_text_password would now ship into the public smolagents/ml-intern-sessions upload unredacted. The PR description for c1778a8 ("Avoid false secret scan hits on redacted env logs") suggests this was a deliberate narrowing; if so, adding (?i) back and re-anchoring \b(...|PASSWORD|SECRET|TOKEN|API_KEY)\b as a second alternation would preserve the old bare-name coverage without re-introducing the max_tokens=4000 false positive (which is lowercase and would still skip in a case-insensitive regex that requires the whole name match).


P2 — aggregate_results.summarize_run silently degrades when matrix.jsonl is missing

post_train_bench/aggregate_results.py:223-258: if matrix.jsonl doesn't exist, expected_cells is set(), and the per-method loop falls back to expected_cells or set(cells_by_method[method]). Each method then only counts cells that actually produced results — there's no fallback for "method ran but every cell crashed" because nothing populates cells_by_method[method] in that case. Coverage looks 100% when in fact the whole method silently dropped. Not a regression (new code), but worth either erroring out when both matrix.jsonl and a --expected-cells arg are absent, or recording the degraded case in the summary.


Nits (skip if you disagree)

  • post_train_bench/run_judge.py:58-74 ensure_codex_auth writes auth.json non-atomically; if two judges race on the same CODEX_HOME they could clobber. Per-task CODEX_HOME would be safer, but JOB_TMP already isolates per task so this is moot in practice.
  • agent/context_manager/manager.py:242-246 treats system_prompt_file as a path if it has a non-. parent. Relative paths like prompts/foo.yaml are resolved against the current working directory rather than agent/prompts/, which differs from the bare-filename case. Worth a one-line docstring on Config.system_prompt_file describing the lookup rules.

Comment thread agent/context_manager/manager.py
Comment thread agent/core/redact.py Outdated
Co-authored-by: OpenAI Codex <codex@openai.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Claude finished @lewtun's task in 3m 2s —— View job


Re-review after 7a7c1b7 / 4c05cd5 — No blocking issues

Verdict: ready to merge.

Status of prior P1 / P2 findings

  • P1 — streaming retry replays content: addressed. agent/core/agent_loop.py:884 now emits an assistant_stream_reset event before each retry path (effort-config, thinking-signature, transient), and both consumers handle it: agent/main.py:419-421 (interactive) and agent/main.py:1565-1566 (headless) call stream_buf.discard(). Caveat: _StreamBuffer.discard() (agent/main.py:371) only clears the unflushed tail — paragraphs already flushed past a \n\n boundary are still visible after retry. That's a structural terminal limitation; the fix handles the common case (failure before the first paragraph break) cleanly. Not re-flagging.
  • P1 — redact regex narrowing: reverted at agent/core/redact.py:38-42. Pattern is back to (?i)\b(HF_TOKEN|…|PASSWORD|SECRET|API_KEY)\s*[:=]…, so bare PASSWORD=, secret=, etc. are scrubbed again. max_tokens=4000 still won't false-match — none of the alternation names appear as a prefix of max_tokens. The two tests in tests/unit/test_redact.py that asserted the new prefix-anchored behavior were removed in the same commit; losing test_lowercase_token_word_is_not_env_secret removes the regression guard against the old max_tokens false positive, but the behavior is still correct under the reverted regex.
  • P2 — aggregate_results silent degradation when matrix.jsonl is missing: addressed at post_train_bench/aggregate_results.py:148-152 (raises FileNotFoundError) and :262 no longer falls back to the empty-cells set. Test added at tests/unit/test_post_train_bench_aggregate_results.py:111-122.
  • Nit — system_prompt_file path-vs-name lookup: addressed. Docstring at agent/context_manager/manager.py:241-245 and config comment at agent/config.py:46-47 now document the rule (bare filename → agent/prompts/; absolute or any path with a parent → used as-is). This is the answer to the inline review comment on manager.py:246 — the change is the rule, the new comments are the rationale.

New commits scanned: 7a7c1b7, 4c05cd5. No new P0/P1 findings introduced.


Tasks
  • Pull latest commits and diff against origin/main
  • Check what changed since previous review
  • Verify previously-flagged P1s addressed
  • Look for new P0 issues introduced
  • Post re-review
• Branch: [`post-train-bench-lewis`](https://github.com/huggingface/ml-intern/tree/post-train-bench-lewis)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant