Skip to content

feat(agents): add RunManifest, --seed, --deterministic, and replay stub#54

Draft
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
thisisvk45:feat/run-manifest-and-replay
Draft

feat(agents): add RunManifest, --seed, --deterministic, and replay stub#54
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
thisisvk45:feat/run-manifest-and-replay

Conversation

@thisisvk45
Copy link
Copy Markdown

Status: DRAFT

Opening as a draft. PR #52 and #53 are still under review - not asking for review attention here yet. Pushing this up so the design is visible if it becomes useful.

Summary

Adds reproducibility primitives to the agent runner so trajectory runs can be inspected and compared. The motivation is open issues #4 (Kimi K2 Thinking Law domain) and #8 (GLM4.7 mean score) which both report reproduction discrepancies that are very hard to root-cause without a record of the inputs that produced a run. Capturing the full input set is the cheapest possible fix and unblocks community verification of leaderboard scores.

What's added

RunManifest sidecar (runner/manifest.py)

A Pydantic model written next to trajectory.json as *.manifest.json. Captures:

  • Code provenance: git SHA, git dirty state, Python version
  • Run config: trajectory ID, agent config ID and values, orchestrator model, full extra_args, seed, deterministic flag
  • Inputs/outputs: MCP server config hash, initial/final snapshot SHA-256 digests
  • Schema-versioned (schema_version: 1) so future fields don't break consumers

--seed and --deterministic flags

Both added to runner.main. --deterministic requires --seed and forces temperature=0 in orchestrator_extra_args (preserving any existing args, not clobbering). Both values are captured in the manifest regardless of whether the LLM provider honors them.

runner.replay subcommand (stub)

Validates a manifest and warns if git_sha has drifted from current HEAD. Does not attempt snapshot restoration - that requires environment-side coordination (snapshot upload/download with proper SHA matching) and is out of scope for this PR. The manifest itself is the primary contribution; the stub is the affordance for future work.

Determinism caveats (also in README)

  • LLM-side determinism via seed is best-effort. Anthropic and Gemini accept seed via LiteLLM's translation layer but do not guarantee bitwise reproducibility.
  • MCP servers with non-deterministic tools (network, time) remain a source of variance. The manifest captures the config hash so divergence is at least localizable.
  • Snapshot SHAs let downstream graders verify they're scoring the same artifacts the original run produced.

Why this design

I considered three approaches:

  1. Full deterministic replay with snapshot upload/download. Out of scope - touches environment/ and grading/ and would be a much larger PR.
  2. Provider-locked seed enforcement - reject runs where the provider doesn't natively support seed. Too restrictive given LiteLLM's coverage.
  3. Capture-only manifest with replay stub. Smallest surface area that meaningfully addresses Reproduction discrepancy: Kimi K2.5 Thinking scores lower than reported on Law domain #4 and Inconsistency in GLM4.7 Official Reported Mean Score #8, leaves room for follow-up PRs to add real replay.

Went with option 3.

Testing

tests/test_manifest.py adds 6 unit tests:

  • Serialization to JSON with all expected fields
  • mcp_server_configs_hash is deterministic across builds with identical input
  • Hash changes when config changes
  • attach_snapshots computes correct SHA-256
  • Missing snapshot path is handled silently (returns None, doesn't raise)
  • Manifest builds successfully when git is unavailable (CI without git)

All pass. No regressions in existing tests.

Out of scope (follow-ups)

  • Snapshot restoration in replay (requires environment-side changes)
  • LiteLLM seed verification per-provider (provider matrix work)
  • Mocked end-to-end runner test exercising the full flag set

Files changed

  • agents/runner/manifest.py (new, ~120 lines)
  • agents/runner/replay.py (new, ~50 lines, stub)
  • agents/runner/main.py (modified: 2 new flags + manifest wiring)
  • agents/tests/__init__.py (new, empty)
  • agents/tests/test_manifest.py (new, 6 tests)
  • agents/README.md (new Reproducibility section)

Refs #4, #8.

Adds reproducibility primitives to the agent runner so trajectory
runs can be inspected and compared. Two open issues (Mercor-Intelligence#4 Kimi K2
Thinking Law domain, Mercor-Intelligence#8 GLM4.7 mean score) are reproduction
discrepancies that are hard to root-cause without a record of the
inputs that produced a run. Capturing the full input set is the
cheapest possible fix and unblocks community verification of
leaderboard scores.

What's added:

1. RunManifest sidecar (runner/manifest.py): a Pydantic model
   written next to trajectory.json capturing git SHA, git dirty
   state, Python version, agent config, model string, full
   extra_args, seed, deterministic flag, MCP server config hash,
   and initial/final snapshot SHA-256 digests. Schema-versioned.

2. --seed and --deterministic flags on runner.main. --deterministic
   requires --seed and forces temperature=0 in orchestrator_extra_args
   (preserving any existing args). Both are captured in the
   manifest.

3. runner.replay subcommand: stub implementation that validates a
   manifest and warns on git_sha drift. Does not attempt snapshot
   restoration - that requires environment-side coordination and
   is out of scope. The manifest itself is the primary
   contribution; the stub is the affordance.

4. Reproducibility section in agents/README.md documenting the
   manifest, the flags, and the determinism caveats (LLM-side
   reproducibility is provider-dependent and not bitwise).

5. tests/test_manifest.py with 6 unit tests: serialization,
   deterministic hashing of MCP config, hash changes on config
   change, snapshot SHA-256 computation, missing-snapshot
   handling, and graceful behavior when not in a git repo.

Out of scope:
- Snapshot restoration in replay (requires environment changes)
- Mocked end-to-end runner test (covered by existing harness)
- Provider-side determinism guarantees (caveats in README)

Refs Mercor-Intelligence#4, Mercor-Intelligence#8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant