fix(grading): strip markdown code fences from judge JSON responses by thisisvk45 · Pull Request #52 · Mercor-Intelligence/archipelago

thisisvk45 · 2026-04-27T19:05:24Z

Summary

The grading judge call sets response_format={"type": "json_object"}, which is honored natively by OpenAI and Gemini providers. For Anthropic, LiteLLM simulates this via prompt-level instruction rather than provider-side enforcement, and Claude models still wrap structured output in markdown code fences. The result: every grading attempt with a Claude judge fails parsing, retries 10 times producing identical fenced output, then raises ValueError("Invalid JSON after 10 attempts"). Trajectories grade as 0.0 regardless of agent quality.

This patch adds a small, idempotent strip_json_fences() helper applied at both judge-response parse sites. Bare JSON passes through unchanged, so behavior for Gemini and OpenAI users is unaffected.

Why this matters

Anyone running Archipelago with their own Anthropic key (the most common case for external researchers verifying leaderboard scores) hits this on the first task and assumes the agent failed when actually only the judge failed. The agent trajectory is correct; only the parse step breaks. This is plausibly a contributor to reproducibility discrepancies reported in #4 and #8 when reviewers attempt verification with different judge models.

Reproduction (before this fix)

Set examples/simple_task/grading_settings.json to any anthropic/claude-* model.
Run examples/simple_task/run.sh.
Observe grades.json showing final_score: 0.0 with the log line:

[JUDGE] JSON retry 1/10: 1 validation error for GradingResponseSchema
  Invalid JSON: expected value at line 1 column 1
  [type=json_invalid, input_value='```json\n{\n  "rationale"...,
  "is_criteria_true": true\n}\n```', input_type=str]

The JSON itself is valid; only the markdown wrapping breaks the parser.

Diagnosis

Two affected call sites in grading/runner/evals/output_llm/:

main.py:326 calls json.loads(raw_content) inside a 10-retry loop. Falls through to model_validate_json at line 339 which also fails.
negative_criteria.py:196 calls model_validate_json(neg_raw_content) with no retry loop. First fenced response is fatal.

The prompt at utils/prompts.py:253 says "Respond with a JSON object" but does not constrain markdown formatting. Anthropic models interpret this as "format JSON readably" and fence it. Gemini happens to comply with bare output.

Fix

New helper at grading/runner/evals/output_llm/utils/json_utils.py:

def strip_json_fences(text: str) -> str:
    """Strip Markdown code fences from an LLM JSON response.

    Idempotent: bare JSON passes through unchanged.
    """

Applied at both parse sites. ~5 production lines. No changes to LLM call interface, retry logic, or prompts.

Testing

Added grading/tests/test_json_utils.py with 8 unit tests covering: bare JSON, fenced JSON (json and plain), trailing whitespace, empty input, whitespace-only input, idempotency, and the negative case where backticks appear inside legitimate JSON values.
All 8 pass under pytest.
Verified end-to-end on examples/simple_task with anthropic/claude-opus-4-5 agent and anthropic/claude-haiku-4-5-20251001 judge: trajectory.status: completed, final_score: 1.0.

Scope

Intentionally minimal. Does not change the prompt to ask Claude not to fence (provider-dependent and fragile), does not migrate to Pydantic structured output mode (more invasive, would conflict with negative_criteria.py's existing schema-based path), and does not touch the retry loop. The defensive parse helper is the smallest surface area that fixes the actual observed failure across all providers.

Files changed

grading/runner/evals/output_llm/utils/json_utils.py (new, 26 lines)
grading/runner/evals/output_llm/main.py (2 line changes + 1 import)
grading/runner/evals/output_llm/negative_criteria.py (1 line change + 1 import)
grading/tests/__init__.py (new, empty)
grading/tests/test_json_utils.py (new, 8 tests)

The grading judge LLM call sets response_format to json_object, which works natively for OpenAI and Gemini. For Anthropic, LiteLLM simulates this via a system prompt hint rather than provider-side enforcement, and Claude models still wrap structured JSON in markdown fences. The parser at output_llm/main.py and output_llm/negative_criteria.py calls json.loads / model_validate_json on the raw response and fails after 10 retries. This adds a strip_json_fences helper applied at both parse sites. The helper is idempotent: bare JSON passes through unchanged, so Gemini and OpenAI behavior is unaffected. Reproduction: configure grading_settings.json to use any anthropic/claude-* model and run any task. Without this fix every trajectory grades as 0.0 with 'Invalid JSON after 10 attempts'. Tested with anthropic/claude-sonnet-4-5 and anthropic/claude-haiku-4-5 on examples/simple_task end-to-end.

…wer log test The agents/README.md had drifted from the codebase. The 'Available agent IDs' section listed loop_agent, toolbelt_agent, and singleshot_agent, but runner/agents/registry.py only registers loop_agent and react_toolbelt_agent. The 'Creating a New Agent' snippet used the wrong parameter name (input vs run_input) and a truncated registry snippet that implied overwriting existing entries rather than adding to them. The README also referenced tests/test_final_answer_log.py as enforcing the agent contract, but that file did not exist. This PR: 1. Resyncs agents/README.md with current symbol names. Removes references to phantom agents. Fixes the 'Creating a New Agent' snippet to use run_input and the dict-update registration style. Updates the stale anthropic/claude-3-5-sonnet-20241022 model string to anthropic/claude-opus-4-5 to match paper Table 9. 2. Adds runner/agents/echo_agent as a 60-line reference implementation. It does not call any LLM or connect to MCP and is intended as the canonical hello-world for new contributors. 3. Adds tests/test_final_answer_log.py with three tight tests: every AgentConfigIds value has an AGENT_REGISTRY entry, every registered agent_impl is an async callable, and echo_agent emits exactly one final_answer log when run end-to-end. A follow-up PR can mock LiteLLM to extend end-to-end coverage to loop_agent and react_toolbelt_agent. 4. Adds CONTRIBUTING-AGENTS.md codifying the agent contract as a one-page checklist. Out of scope: typed final_answer field on AgentTrajectoryOutput, RunManifest replay harness for reproducibility issues Mercor-Intelligence#4 and Mercor-Intelligence#8. The grading-side fence-stripping fix is in Mercor-Intelligence#52.

thisisvk45 · 2026-04-27T20:44:03Z

Verification update: ran the fix across the Claude judge family on examples/simple_task to confirm it is not Haiku-specific. Same agent (anthropic/claude-opus-4-5), same task, same harness build, only the judge model varies.

Judge model	final_score	runtime	trajectory status	notes
`anthropic/claude-opus-4-5`	1.0	38s	completed	clean parse
`anthropic/claude-sonnet-4-5`	1.0	41s	completed	clean parse
`anthropic/claude-haiku-4-5-20251001`	1.0	32s	completed	clean parse

All three judges return fenced JSON without the fix and parse cleanly with it. The 8 unit tests in this PR cover the helper, and these runs cover the integration path end-to-end across the Claude family.

Without the fix, every run in this table would have produced final_score: 0.0 with ValueError("Invalid JSON after 10 attempts"). The agent trajectory itself was identical across runs (same answer, same files touched), confirming the failure mode is judge-side, not agent-side.

thisisvk45 force-pushed the fix/grading-strip-json-fences branch from 33b1f90 to 8da7683 Compare April 27, 2026 19:18

thisisvk45 mentioned this pull request Apr 27, 2026

docs(agents): sync README with codebase, add echo_agent and final_answer log test #53

Open

thisisvk45 mentioned this pull request Apr 27, 2026

feat(agents): add RunManifest, --seed, --deterministic, and replay stub #54

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(grading): strip markdown code fences from judge JSON responses#52

fix(grading): strip markdown code fences from judge JSON responses#52
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
thisisvk45:fix/grading-strip-json-fences

thisisvk45 commented Apr 27, 2026

Uh oh!

thisisvk45 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thisisvk45 commented Apr 27, 2026

Summary

Why this matters

Reproduction (before this fix)

Diagnosis

Fix

Testing

Scope

Files changed

Uh oh!

thisisvk45 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant