feat(agents): add RunManifest, --seed, --deterministic, and replay stub#54
Draft
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
Draft
feat(agents): add RunManifest, --seed, --deterministic, and replay stub#54thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
Conversation
Adds reproducibility primitives to the agent runner so trajectory runs can be inspected and compared. Two open issues (Mercor-Intelligence#4 Kimi K2 Thinking Law domain, Mercor-Intelligence#8 GLM4.7 mean score) are reproduction discrepancies that are hard to root-cause without a record of the inputs that produced a run. Capturing the full input set is the cheapest possible fix and unblocks community verification of leaderboard scores. What's added: 1. RunManifest sidecar (runner/manifest.py): a Pydantic model written next to trajectory.json capturing git SHA, git dirty state, Python version, agent config, model string, full extra_args, seed, deterministic flag, MCP server config hash, and initial/final snapshot SHA-256 digests. Schema-versioned. 2. --seed and --deterministic flags on runner.main. --deterministic requires --seed and forces temperature=0 in orchestrator_extra_args (preserving any existing args). Both are captured in the manifest. 3. runner.replay subcommand: stub implementation that validates a manifest and warns on git_sha drift. Does not attempt snapshot restoration - that requires environment-side coordination and is out of scope. The manifest itself is the primary contribution; the stub is the affordance. 4. Reproducibility section in agents/README.md documenting the manifest, the flags, and the determinism caveats (LLM-side reproducibility is provider-dependent and not bitwise). 5. tests/test_manifest.py with 6 unit tests: serialization, deterministic hashing of MCP config, hash changes on config change, snapshot SHA-256 computation, missing-snapshot handling, and graceful behavior when not in a git repo. Out of scope: - Snapshot restoration in replay (requires environment changes) - Mocked end-to-end runner test (covered by existing harness) - Provider-side determinism guarantees (caveats in README) Refs Mercor-Intelligence#4, Mercor-Intelligence#8.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: DRAFT
Opening as a draft. PR #52 and #53 are still under review - not asking for review attention here yet. Pushing this up so the design is visible if it becomes useful.
Summary
Adds reproducibility primitives to the agent runner so trajectory runs can be inspected and compared. The motivation is open issues #4 (Kimi K2 Thinking Law domain) and #8 (GLM4.7 mean score) which both report reproduction discrepancies that are very hard to root-cause without a record of the inputs that produced a run. Capturing the full input set is the cheapest possible fix and unblocks community verification of leaderboard scores.
What's added
RunManifestsidecar (runner/manifest.py)A Pydantic model written next to
trajectory.jsonas*.manifest.json. Captures:extra_args, seed, deterministic flagschema_version: 1) so future fields don't break consumers--seedand--deterministicflagsBoth added to
runner.main.--deterministicrequires--seedand forcestemperature=0inorchestrator_extra_args(preserving any existing args, not clobbering). Both values are captured in the manifest regardless of whether the LLM provider honors them.runner.replaysubcommand (stub)Validates a manifest and warns if
git_shahas drifted from current HEAD. Does not attempt snapshot restoration - that requires environment-side coordination (snapshot upload/download with proper SHA matching) and is out of scope for this PR. The manifest itself is the primary contribution; the stub is the affordance for future work.Determinism caveats (also in README)
seedis best-effort. Anthropic and Gemini acceptseedvia LiteLLM's translation layer but do not guarantee bitwise reproducibility.Why this design
I considered three approaches:
Went with option 3.
Testing
tests/test_manifest.pyadds 6 unit tests:mcp_server_configs_hashis deterministic across builds with identical inputattach_snapshotscomputes correct SHA-256All pass. No regressions in existing tests.
Out of scope (follow-ups)
Files changed
agents/runner/manifest.py(new, ~120 lines)agents/runner/replay.py(new, ~50 lines, stub)agents/runner/main.py(modified: 2 new flags + manifest wiring)agents/tests/__init__.py(new, empty)agents/tests/test_manifest.py(new, 6 tests)agents/README.md(new Reproducibility section)Refs #4, #8.