[WIP] Integrate PostTrainBench#254
Conversation
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Use Codex CLI judging, split solve and eval images, derive solve budgets from num_hours, isolate per-task HF caches, and add prompt/logging/model-validation safeguards. Co-authored-by: Codex <codex@openai.com>
Preinstall CUDA PyTorch before vLLM and add wheel/setup tooling before flash-attn so the eval container resolves and builds successfully. Co-authored-by: Codex <codex@openai.com>
Pass global Codex options before exec so the PostTrainBench judge runs with the installed CLI, and create CODEX_HOME before invocation. Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Run integrity checks and artifact collection from a trusted runner copy, preserve failed-run evidence, record reproducibility hashes, and add factor-weighted aggregation with variance reporting. Co-authored-by: Codex <codex@openai.com>
Keep the solve source mount read-only while installing a non-editable wheel before the measured solve phase. Co-authored-by: Codex <codex@openai.com>
After moving installation outside the solve measurement, default the solve timeout to the declared task duration instead of duration plus runner grace time. Co-authored-by: Codex <codex@openai.com>
Use a 10-minute smoke budget now that the measured solve timeout no longer includes hidden runner grace time. Co-authored-by: Codex <codex@openai.com>
Add a smoke-only baseline final_model fallback after protected-file verification so smoke runs exercise judge, validation, evaluation, artifact, and hash paths without relying on the short-budget agent producing a model. Full runs remain unchanged. Co-authored-by: Codex <codex@openai.com>
Add strict artifact validation mode and explicit reprompt labeling for PostTrainBench runs. Co-authored-by: OpenAI Codex <codex@openai.com>
Add a small one-task-per-model validation matrix for checking final_model artifact creation before larger runs. Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Stop running the local post-run secret scanner in the PTB wrapper so benchmark rows are judged by PTB integrity, final-model precheck, and metrics only. Remove secret-scan artifact collection/docs and obsolete tests. Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
|
Claude finished @lewtun's task in 6m 20s —— View job No blocking issues — 2 P1Verdict: ready to merge What I checked
P1 — Streaming retries can replay assistant content to the consumer
What is not rolled back on retry are the If this is acceptable, consider at least emitting a single P1 — Redact regex no longer scrubs bare/lowercase secret-y env names
The value-side patterns ( P2 —
|
Co-authored-by: OpenAI Codex <codex@openai.com>
|
Claude finished @lewtun's task in 3m 2s —— View job Re-review after 7a7c1b7 / 4c05cd5 — No blocking issuesVerdict: ready to merge. Status of prior P1 / P2 findings
New commits scanned: 7a7c1b7, 4c05cd5. No new P0/P1 findings introduced. Tasks
|
No description provided.