feat(agents): add RunManifest, --seed, --deterministic, and replay stub by thisisvk45 · Pull Request #54 · Mercor-Intelligence/archipelago

thisisvk45 · 2026-04-27T20:20:26Z

Status: DRAFT

Opening as a draft. PR #52 and #53 are still under review - not asking for review attention here yet. Pushing this up so the design is visible if it becomes useful.

Summary

Adds reproducibility primitives to the agent runner so trajectory runs can be inspected and compared. The motivation is open issues #4 (Kimi K2 Thinking Law domain) and #8 (GLM4.7 mean score) which both report reproduction discrepancies that are very hard to root-cause without a record of the inputs that produced a run. Capturing the full input set is the cheapest possible fix and unblocks community verification of leaderboard scores.

What's added

`RunManifest` sidecar (`runner/manifest.py`)

A Pydantic model written next to trajectory.json as *.manifest.json. Captures:

Code provenance: git SHA, git dirty state, Python version
Run config: trajectory ID, agent config ID and values, orchestrator model, full extra_args, seed, deterministic flag
Inputs/outputs: MCP server config hash, initial/final snapshot SHA-256 digests
Schema-versioned (schema_version: 1) so future fields don't break consumers

`--seed` and `--deterministic` flags

Both added to runner.main. --deterministic requires --seed and forces temperature=0 in orchestrator_extra_args (preserving any existing args, not clobbering). Both values are captured in the manifest regardless of whether the LLM provider honors them.

`runner.replay` subcommand (stub)

Validates a manifest and warns if git_sha has drifted from current HEAD. Does not attempt snapshot restoration - that requires environment-side coordination (snapshot upload/download with proper SHA matching) and is out of scope for this PR. The manifest itself is the primary contribution; the stub is the affordance for future work.

Determinism caveats (also in README)

LLM-side determinism via seed is best-effort. Anthropic and Gemini accept seed via LiteLLM's translation layer but do not guarantee bitwise reproducibility.
MCP servers with non-deterministic tools (network, time) remain a source of variance. The manifest captures the config hash so divergence is at least localizable.
Snapshot SHAs let downstream graders verify they're scoring the same artifacts the original run produced.

Why this design

I considered three approaches:

Full deterministic replay with snapshot upload/download. Out of scope - touches environment/ and grading/ and would be a much larger PR.
Provider-locked seed enforcement - reject runs where the provider doesn't natively support seed. Too restrictive given LiteLLM's coverage.
Capture-only manifest with replay stub. Smallest surface area that meaningfully addresses Reproduction discrepancy: Kimi K2.5 Thinking scores lower than reported on Law domain #4 and Inconsistency in GLM4.7 Official Reported Mean Score #8, leaves room for follow-up PRs to add real replay.

Went with option 3.

Testing

tests/test_manifest.py adds 6 unit tests:

Serialization to JSON with all expected fields
mcp_server_configs_hash is deterministic across builds with identical input
Hash changes when config changes
attach_snapshots computes correct SHA-256
Missing snapshot path is handled silently (returns None, doesn't raise)
Manifest builds successfully when git is unavailable (CI without git)

All pass. No regressions in existing tests.

Out of scope (follow-ups)

Snapshot restoration in replay (requires environment-side changes)
LiteLLM seed verification per-provider (provider matrix work)
Mocked end-to-end runner test exercising the full flag set

Files changed

agents/runner/manifest.py (new, ~120 lines)
agents/runner/replay.py (new, ~50 lines, stub)
agents/runner/main.py (modified: 2 new flags + manifest wiring)
agents/tests/__init__.py (new, empty)
agents/tests/test_manifest.py (new, 6 tests)
agents/README.md (new Reproducibility section)

Refs #4, #8.

Adds reproducibility primitives to the agent runner so trajectory runs can be inspected and compared. Two open issues (Mercor-Intelligence#4 Kimi K2 Thinking Law domain, Mercor-Intelligence#8 GLM4.7 mean score) are reproduction discrepancies that are hard to root-cause without a record of the inputs that produced a run. Capturing the full input set is the cheapest possible fix and unblocks community verification of leaderboard scores. What's added: 1. RunManifest sidecar (runner/manifest.py): a Pydantic model written next to trajectory.json capturing git SHA, git dirty state, Python version, agent config, model string, full extra_args, seed, deterministic flag, MCP server config hash, and initial/final snapshot SHA-256 digests. Schema-versioned. 2. --seed and --deterministic flags on runner.main. --deterministic requires --seed and forces temperature=0 in orchestrator_extra_args (preserving any existing args). Both are captured in the manifest. 3. runner.replay subcommand: stub implementation that validates a manifest and warns on git_sha drift. Does not attempt snapshot restoration - that requires environment-side coordination and is out of scope. The manifest itself is the primary contribution; the stub is the affordance. 4. Reproducibility section in agents/README.md documenting the manifest, the flags, and the determinism caveats (LLM-side reproducibility is provider-dependent and not bitwise). 5. tests/test_manifest.py with 6 unit tests: serialization, deterministic hashing of MCP config, hash changes on config change, snapshot SHA-256 computation, missing-snapshot handling, and graceful behavior when not in a git repo. Out of scope: - Snapshot restoration in replay (requires environment changes) - Mocked end-to-end runner test (covered by existing harness) - Provider-side determinism guarantees (caveats in README) Refs Mercor-Intelligence#4, Mercor-Intelligence#8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): add RunManifest, --seed, --deterministic, and replay stub#54

feat(agents): add RunManifest, --seed, --deterministic, and replay stub#54
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
thisisvk45:feat/run-manifest-and-replay

thisisvk45 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thisisvk45 commented Apr 27, 2026

Status: DRAFT

Summary

What's added

RunManifest sidecar (runner/manifest.py)

--seed and --deterministic flags

runner.replay subcommand (stub)

Determinism caveats (also in README)

Why this design

Testing

Out of scope (follow-ups)

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`RunManifest` sidecar (`runner/manifest.py`)

`--seed` and `--deterministic` flags

`runner.replay` subcommand (stub)