-
Notifications
You must be signed in to change notification settings - Fork 1k
[WIP] Integrate PostTrainBench #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lewtun
wants to merge
50
commits into
main
Choose a base branch
from
post-train-bench-lewis
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
50 commits
Select commit
Hold shift + click to select a range
9b3f93a
Add PostTrainBench Docker evaluation runner
lewtun dc63bec
Default PostTrainBench agent model
lewtun a985dc5
Merge branch 'main' into post-train-bench-lewis
lewtun 92c3d38
Fix PostTrainBench container agent launch
lewtun 3c33e73
Include Slurm job id in PostTrainBench run ids
lewtun e8aa2a9
Document PostTrainBench run artifact tree
lewtun 23094cc
Make smoke PostTrainBench runs five minutes
lewtun 9f49a5d
Use shorter PostTrainBench config names
lewtun e908c47
Add MCP
lewtun 17ee55c
Export PostTrainBench source snapshot path
lewtun 12b3cef
Set descriptive PostTrainBench Slurm task names
lewtun 3fe6a74
Set PostTrainBench Slurm time by mode
lewtun 6a8d353
Limit PostTrainBench smoke evaluation
lewtun f025068
Shorten PostTrainBench Slurm job names
lewtun d339d32
Reference final model files in PostTrainBench artifacts
lewtun 0204244
Add PTB prompt
lewtun f5fd9f2
Merge branch 'main' into post-train-bench-lewis
lewtun 103a834
Amend system prompt
lewtun 48d96da
Harden PostTrainBench runner isolation
lewtun 1478bb2
Fix PostTrainBench eval image build
lewtun 84e0373
Fix Codex judge CLI option ordering
lewtun 066cc15
Fix Codex judge API key auth
lewtun 0e49b36
Make PostTrainBench model validation strict
lewtun f2c4e43
Fix config
lewtun bbc6055
Harden PostTrainBench integrity checks
lewtun c2cc788
Detect PostTrainBench harness tampering
lewtun 7904837
Harden PostTrainBench runner finalization
lewtun abd1a76
Install benchmark agent from writable build copy
lewtun 3c8dc8b
Use task budget for measured solve timeout
lewtun 2f90b2b
Make smoke budget strict and realistic
lewtun 746c3df
Make PostTrainBench smoke deterministic
lewtun 0a7b23d
Add PostTrainBench validation reprompt variant
lewtun aa6e5d4
Add per-model PostTrainBench validation mode
lewtun d4fe09d
Make PostTrainBench integrity runner Python-compatible
lewtun d0b39be
Tighten PostTrainBench final model prompt contract
lewtun 2f7db44
Add ten-job PostTrainBench smoke mode
lewtun 8ceb9d9
Add PostTrainBench array throttle option
lewtun 4619b63
Strengthen PostTrainBench reprompt recovery
lewtun c1778a8
Avoid false secret scan hits on redacted env logs
lewtun e14612e
Guard PostTrainBench against broad process kills
lewtun 39fb2f3
Retry transient streaming LLM failures
lewtun e88bac7
Ignore PTB bytecode caches in integrity checks
lewtun f2246b6
Merge origin/main into post-train-bench-lewis
lewtun 6f947e4
Fix PTB bytecode cleanup find command
lewtun 0067b15
Remove PTB secret scan gate
lewtun 84c96d7
Match PTB baseline fallback scoring
lewtun e754864
Increase PTB full run walltime
lewtun d8b4915
Merge origin/main into post-train-bench-lewis
lewtun 4c05cd5
Update CLI rendering tests for config forwarding
lewtun 7a7c1b7
Address review feedback on streaming and PTB aggregation
lewtun File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -71,5 +71,6 @@ datasets/ | |
| models/ | ||
| checkpoint-*/ | ||
| runs/ | ||
| post_train_bench/runs/ | ||
| wandb/ | ||
| frontend/tsconfig.tsbuildinfo | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.