[WIP] Integrate PostTrainBench by lewtun · Pull Request #254 · huggingface/ml-intern

lewtun · 2026-05-11T09:58:06Z

No description provided.

Co-authored-by: Codex <codex@openai.com>

Use Codex CLI judging, split solve and eval images, derive solve budgets from num_hours, isolate per-task HF caches, and add prompt/logging/model-validation safeguards. Co-authored-by: Codex <codex@openai.com>

Preinstall CUDA PyTorch before vLLM and add wheel/setup tooling before flash-attn so the eval container resolves and builds successfully. Co-authored-by: Codex <codex@openai.com>

Pass global Codex options before exec so the PostTrainBench judge runs with the installed CLI, and create CODEX_HOME before invocation. Co-authored-by: Codex <codex@openai.com>

Co-authored-by: Codex <codex@openai.com>

Run integrity checks and artifact collection from a trusted runner copy, preserve failed-run evidence, record reproducibility hashes, and add factor-weighted aggregation with variance reporting. Co-authored-by: Codex <codex@openai.com>

Keep the solve source mount read-only while installing a non-editable wheel before the measured solve phase. Co-authored-by: Codex <codex@openai.com>

After moving installation outside the solve measurement, default the solve timeout to the declared task duration instead of duration plus runner grace time. Co-authored-by: Codex <codex@openai.com>

Use a 10-minute smoke budget now that the measured solve timeout no longer includes hidden runner grace time. Co-authored-by: Codex <codex@openai.com>

Add a smoke-only baseline final_model fallback after protected-file verification so smoke runs exercise judge, validation, evaluation, artifact, and hash paths without relying on the short-budget agent producing a model. Full runs remain unchanged. Co-authored-by: Codex <codex@openai.com>

Add strict artifact validation mode and explicit reprompt labeling for PostTrainBench runs. Co-authored-by: OpenAI Codex <codex@openai.com>

Add a small one-task-per-model validation matrix for checking final_model artifact creation before larger runs. Co-authored-by: OpenAI Codex <codex@openai.com>

Co-authored-by: Codex <codex@openai.com>

Stop running the local post-run secret scanner in the PTB wrapper so benchmark rows are judged by PTB integrity, final-model precheck, and metrics only. Remove secret-scan artifact collection/docs and obsolete tests. Co-authored-by: OpenAI Codex <codex@openai.com>

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-14T11:04:01Z

Claude finished @lewtun's task in 6m 20s —— View job

No blocking issues — 2 P1

Verdict: ready to merge

What I checked

post_train_bench/integrity.py — judge classification, protected-files snapshot/verify, final-model precheck, manifest path traversal handling
post_train_bench/run_judge.py — Codex CLI invocation, auth file mode, prompt-injection hardening, missing-output detection
post_train_bench/aggregate_results.py — baseline fallback wiring, expected-cell vs scored-cell accounting, variance summary
post_train_bench/run_task_docker.sh & submit_eval_set.sh & launch.slurm — container mounts, env propagation, immutable-image gate, dirty-worktree gate, matrix parsing
agent/main.py headless reprompt loop and --config plumbing
agent/core/agent_loop.py streaming retry (new mid-stream retry behavior) and jittered backoff
agent/config.py, agent/context_manager/manager.py, agent/core/{session,tools,redact}.py, agent/tools/{local_tools,sandbox_client}.py
Test coverage for new behavior (integrity, aggregate, judge, collect_artifacts, redact, web_search disabled tools, CLI config-forwarding)

P1 — Streaming retries can replay assistant content to the consumer

agent/core/agent_loop.py:866-1016 now wraps the entire streaming response (both acompletion setup and the async for chunk in response iteration) inside the retry loop, with a 7-step backoff up to _LLM_RETRY_DELAYS = [15, 30, 60, 120, 300, 600, 600]. tool_calls_acc/full_content are correctly reset per attempt (agent_loop.py:867-872) so the final LLMResult is consistent.

What is not rolled back on retry are the assistant_chunk events already emitted via session.send_event(Event(event_type="assistant_chunk", ...)) at agent_loop.py:902-909 before the failure. For mid-stream errors (the new test at tests/unit/test_llm_error_classification.py:73-80 documents MidStreamFallbackError ... AnthropicError - Overloaded as transient), the consumer has already written the partial response to the terminal/log. After the retry succeeds, the full reply is emitted again, so interactive CLI users will see the start of the response duplicated. For headless PTB this is just noise in solve_out.txt; for agent/main.py:1200-style interactive use it’s a visible regression.

If this is acceptable, consider at least emitting a single tool_log marker on retry so the user can spot where the rewind happened. Otherwise it may be worth resetting stream_buf from the agent loop on retry by emitting a sentinel event (e.g., assistant_stream_reset) and handling it in agent/main.py:1573-1577.

P1 — Redact regex no longer scrubs bare/lowercase secret-y env names

SECRET=hunter2, PASSWORD=hunter2, TOKEN=abc, API_KEY=abc (bare uppercase) — no longer matched
password=hunter2, secret=hunter2 (lowercase, old (?i) flag is gone) — no longer matched
MY_SECRET=hunter2, INFERENCE_TOKEN=..., AWS_ACCESS_KEY_ID=... — still matched (covered by new tests in tests/unit/test_redact.py:54-63)

The value-side patterns (hf_…, sk-ant-…, gh[pousr]_…, AKIA…) still fire if the secret matches one of those formats, but a free-form PASSWORD=plain_text_password would now ship into the public smolagents/ml-intern-sessions upload unredacted. The PR description for c1778a8 ("Avoid false secret scan hits on redacted env logs") suggests this was a deliberate narrowing; if so, adding (?i) back and re-anchoring \b(...|PASSWORD|SECRET|TOKEN|API_KEY)\b as a second alternation would preserve the old bare-name coverage without re-introducing the max_tokens=4000 false positive (which is lowercase and would still skip in a case-insensitive regex that requires the whole name match).

P2 — `aggregate_results.summarize_run` silently degrades when `matrix.jsonl` is missing

post_train_bench/aggregate_results.py:223-258: if matrix.jsonl doesn't exist, expected_cells is set(), and the per-method loop falls back to expected_cells or set(cells_by_method[method]). Each method then only counts cells that actually produced results — there's no fallback for "method ran but every cell crashed" because nothing populates cells_by_method[method] in that case. Coverage looks 100% when in fact the whole method silently dropped. Not a regression (new code), but worth either erroring out when both matrix.jsonl and a --expected-cells arg are absent, or recording the degraded case in the summary.

Nits (skip if you disagree)

post_train_bench/run_judge.py:58-74 ensure_codex_auth writes auth.json non-atomically; if two judges race on the same CODEX_HOME they could clobber. Per-task CODEX_HOME would be safer, but JOB_TMP already isolates per task so this is moot in practice.
agent/context_manager/manager.py:242-246 treats system_prompt_file as a path if it has a non-. parent. Relative paths like prompts/foo.yaml are resolved against the current working directory rather than agent/prompts/, which differs from the bare-filename case. Worth a one-line docstring on Config.system_prompt_file describing the lookup rules.

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-14T12:18:54Z

Claude finished @lewtun's task in 3m 2s —— View job

Re-review after `7a7c1b7` / `4c05cd5` — No blocking issues

Verdict: ready to merge.

Status of prior P1 / P2 findings

P1 — streaming retry replays content: addressed. agent/core/agent_loop.py:884 now emits an assistant_stream_reset event before each retry path (effort-config, thinking-signature, transient), and both consumers handle it: agent/main.py:419-421 (interactive) and agent/main.py:1565-1566 (headless) call stream_buf.discard(). Caveat: _StreamBuffer.discard() (agent/main.py:371) only clears the unflushed tail — paragraphs already flushed past a \n\n boundary are still visible after retry. That's a structural terminal limitation; the fix handles the common case (failure before the first paragraph break) cleanly. Not re-flagging.
P1 — redact regex narrowing: reverted at agent/core/redact.py:38-42. Pattern is back to (?i)\b(HF_TOKEN|…|PASSWORD|SECRET|API_KEY)\s*[:=]…, so bare PASSWORD=, secret=, etc. are scrubbed again. max_tokens=4000 still won't false-match — none of the alternation names appear as a prefix of max_tokens. The two tests in tests/unit/test_redact.py that asserted the new prefix-anchored behavior were removed in the same commit; losing test_lowercase_token_word_is_not_env_secret removes the regression guard against the old max_tokens false positive, but the behavior is still correct under the reverted regex.
P2 — aggregate_results silent degradation when matrix.jsonl is missing: addressed at post_train_bench/aggregate_results.py:148-152 (raises FileNotFoundError) and :262 no longer falls back to the empty-cells set. Test added at tests/unit/test_post_train_bench_aggregate_results.py:111-122.
Nit — system_prompt_file path-vs-name lookup: addressed. Docstring at agent/context_manager/manager.py:241-245 and config comment at agent/config.py:46-47 now document the rule (bare filename → agent/prompts/; absolute or any path with a parent → used as-is). This is the answer to the inline review comment on manager.py:246 — the change is the rule, the new comments are the rationale.

New commits scanned: 7a7c1b7, 4c05cd5. No new P0/P1 findings introduced.

Tasks

Pull latest commits and diff against origin/main
Check what changed since previous review
Verify previously-flagged P1s addressed
Look for new P0 issues introduced
Post re-review

• Branch: [`post-train-bench-lewis`](https://github.com/huggingface/ml-intern/tree/post-train-bench-lewis)

lewtun and others added 30 commits April 28, 2026 08:18

Add PostTrainBench Docker evaluation runner

9b3f93a

Co-authored-by: Codex <codex@openai.com>

Default PostTrainBench agent model

dc63bec

Co-authored-by: Codex <codex@openai.com>

Merge branch 'main' into post-train-bench-lewis

a985dc5

Fix PostTrainBench container agent launch

92c3d38

Co-authored-by: Codex <codex@openai.com>

Include Slurm job id in PostTrainBench run ids

3c33e73

Co-authored-by: Codex <codex@openai.com>

Document PostTrainBench run artifact tree

e8aa2a9

Co-authored-by: Codex <codex@openai.com>

Make smoke PostTrainBench runs five minutes

23094cc

Co-authored-by: Codex <codex@openai.com>

Use shorter PostTrainBench config names

9f49a5d

Co-authored-by: Codex <codex@openai.com>

Add MCP

e908c47

Export PostTrainBench source snapshot path

17ee55c

Co-authored-by: Codex <codex@openai.com>

Set descriptive PostTrainBench Slurm task names

12b3cef

Co-authored-by: Codex <codex@openai.com>

Set PostTrainBench Slurm time by mode

3fe6a74

Co-authored-by: Codex <codex@openai.com>

Limit PostTrainBench smoke evaluation

6a8d353

Co-authored-by: Codex <codex@openai.com>

Shorten PostTrainBench Slurm job names

f025068

Co-authored-by: Codex <codex@openai.com>

Reference final model files in PostTrainBench artifacts

d339d32

Co-authored-by: Codex <codex@openai.com>

Add PTB prompt

0204244

Merge branch 'main' into post-train-bench-lewis

f5fd9f2

Amend system prompt

103a834

Harden PostTrainBench runner isolation

48d96da

Use Codex CLI judging, split solve and eval images, derive solve budgets from num_hours, isolate per-task HF caches, and add prompt/logging/model-validation safeguards. Co-authored-by: Codex <codex@openai.com>

Fix PostTrainBench eval image build

1478bb2

Preinstall CUDA PyTorch before vLLM and add wheel/setup tooling before flash-attn so the eval container resolves and builds successfully. Co-authored-by: Codex <codex@openai.com>

Fix Codex judge CLI option ordering

84e0373

Pass global Codex options before exec so the PostTrainBench judge runs with the installed CLI, and create CODEX_HOME before invocation. Co-authored-by: Codex <codex@openai.com>

Fix Codex judge API key auth

066cc15

Co-authored-by: Codex <codex@openai.com>

Make PostTrainBench model validation strict

0e49b36

Co-authored-by: Codex <codex@openai.com>

Fix config

f2c4e43

Harden PostTrainBench integrity checks

bbc6055

Co-authored-by: Codex <codex@openai.com>

Detect PostTrainBench harness tampering

c2cc788

Co-authored-by: Codex <codex@openai.com>

Harden PostTrainBench runner finalization

7904837

Run integrity checks and artifact collection from a trusted runner copy, preserve failed-run evidence, record reproducibility hashes, and add factor-weighted aggregation with variance reporting. Co-authored-by: Codex <codex@openai.com>

Install benchmark agent from writable build copy

abd1a76

Keep the solve source mount read-only while installing a non-editable wheel before the measured solve phase. Co-authored-by: Codex <codex@openai.com>

Use task budget for measured solve timeout

3c8dc8b

After moving installation outside the solve measurement, default the solve timeout to the declared task duration instead of duration plus runner grace time. Co-authored-by: Codex <codex@openai.com>

Make smoke budget strict and realistic

2f90b2b

Use a 10-minute smoke budget now that the measured solve timeout no longer includes hidden runner grace time. Co-authored-by: Codex <codex@openai.com>

lewtun and others added 19 commits May 1, 2026 17:00

Add PostTrainBench validation reprompt variant

0a7b23d

Add strict artifact validation mode and explicit reprompt labeling for PostTrainBench runs. Co-authored-by: OpenAI Codex <codex@openai.com>

Add per-model PostTrainBench validation mode

aa6e5d4

Add a small one-task-per-model validation matrix for checking final_model artifact creation before larger runs. Co-authored-by: OpenAI Codex <codex@openai.com>

Make PostTrainBench integrity runner Python-compatible

d4fe09d

Co-authored-by: Codex <codex@openai.com>

Tighten PostTrainBench final model prompt contract

d0b39be

Co-authored-by: Codex <codex@openai.com>

Add ten-job PostTrainBench smoke mode

2f7db44

Co-authored-by: Codex <codex@openai.com>

Add PostTrainBench array throttle option

8ceb9d9

Co-authored-by: Codex <codex@openai.com>

Strengthen PostTrainBench reprompt recovery

4619b63

Co-authored-by: Codex <codex@openai.com>

Avoid false secret scan hits on redacted env logs

c1778a8

Co-authored-by: Codex <codex@openai.com>

Guard PostTrainBench against broad process kills

e14612e

Co-authored-by: Codex <codex@openai.com>

Retry transient streaming LLM failures

39fb2f3

Co-authored-by: Codex <codex@openai.com>

Ignore PTB bytecode caches in integrity checks

e88bac7

Co-authored-by: Codex <codex@openai.com>

Merge origin/main into post-train-bench-lewis

f2246b6

Co-authored-by: Codex <codex@openai.com>

Fix PTB bytecode cleanup find command

6f947e4

Co-authored-by: Codex <codex@openai.com>

Match PTB baseline fallback scoring

84c96d7

Co-authored-by: OpenAI Codex <codex@openai.com>

Increase PTB full run walltime

e754864

Co-authored-by: OpenAI Codex <codex@openai.com>

Merge origin/main into post-train-bench-lewis

d8b4915

Co-authored-by: OpenAI Codex <codex@openai.com>

Update CLI rendering tests for config forwarding

4c05cd5

Co-authored-by: OpenAI Codex <codex@openai.com>

lewtun marked this pull request as ready for review May 14, 2026 11:03

lewtun commented May 14, 2026

View reviewed changes

Comment thread agent/context_manager/manager.py

Comment thread agent/core/redact.py Outdated

Address review feedback on streaming and PTB aggregation

7a7c1b7

Co-authored-by: OpenAI Codex <codex@openai.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Integrate PostTrainBench#254

[WIP] Integrate PostTrainBench#254
lewtun wants to merge 50 commits into
mainfrom
post-train-bench-lewis

lewtun commented May 11, 2026

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lewtun commented May 11, 2026

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

No blocking issues — 2 P1

P1 — Streaming retries can replay assistant content to the consumer

P1 — Redact regex no longer scrubs bare/lowercase secret-y env names

P2 — aggregate_results.summarize_run silently degrades when matrix.jsonl is missing

Nits (skip if you disagree)

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review after 7a7c1b7 / 4c05cd5 — No blocking issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 14, 2026 •

edited

Loading

P2 — `aggregate_results.summarize_run` silently degrades when `matrix.jsonl` is missing

github-actions Bot commented May 14, 2026 •

edited

Loading

Re-review after `7a7c1b7` / `4c05cd5` — No blocking issues