Skip to content

fix(grading): strip markdown code fences from judge JSON responses#52

Open
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
thisisvk45:fix/grading-strip-json-fences
Open

fix(grading): strip markdown code fences from judge JSON responses#52
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
thisisvk45:fix/grading-strip-json-fences

Conversation

@thisisvk45
Copy link
Copy Markdown

Summary

The grading judge call sets response_format={"type": "json_object"}, which is honored natively by OpenAI and Gemini providers. For Anthropic, LiteLLM simulates this via prompt-level instruction rather than provider-side enforcement, and Claude models still wrap structured output in markdown code fences. The result: every grading attempt with a Claude judge fails parsing, retries 10 times producing identical fenced output, then raises ValueError("Invalid JSON after 10 attempts"). Trajectories grade as 0.0 regardless of agent quality.

This patch adds a small, idempotent strip_json_fences() helper applied at both judge-response parse sites. Bare JSON passes through unchanged, so behavior for Gemini and OpenAI users is unaffected.

Why this matters

Anyone running Archipelago with their own Anthropic key (the most common case for external researchers verifying leaderboard scores) hits this on the first task and assumes the agent failed when actually only the judge failed. The agent trajectory is correct; only the parse step breaks. This is plausibly a contributor to reproducibility discrepancies reported in #4 and #8 when reviewers attempt verification with different judge models.

Reproduction (before this fix)

  1. Set examples/simple_task/grading_settings.json to any anthropic/claude-* model.
  2. Run examples/simple_task/run.sh.
  3. Observe grades.json showing final_score: 0.0 with the log line:
[JUDGE] JSON retry 1/10: 1 validation error for GradingResponseSchema
  Invalid JSON: expected value at line 1 column 1
  [type=json_invalid, input_value='```json\n{\n  "rationale"...,
  "is_criteria_true": true\n}\n```', input_type=str]

The JSON itself is valid; only the markdown wrapping breaks the parser.

Diagnosis

Two affected call sites in grading/runner/evals/output_llm/:

  1. main.py:326 calls json.loads(raw_content) inside a 10-retry loop. Falls through to model_validate_json at line 339 which also fails.
  2. negative_criteria.py:196 calls model_validate_json(neg_raw_content) with no retry loop. First fenced response is fatal.

The prompt at utils/prompts.py:253 says "Respond with a JSON object" but does not constrain markdown formatting. Anthropic models interpret this as "format JSON readably" and fence it. Gemini happens to comply with bare output.

Fix

New helper at grading/runner/evals/output_llm/utils/json_utils.py:

def strip_json_fences(text: str) -> str:
    """Strip Markdown code fences from an LLM JSON response.

    Idempotent: bare JSON passes through unchanged.
    """

Applied at both parse sites. ~5 production lines. No changes to LLM call interface, retry logic, or prompts.

Testing

  • Added grading/tests/test_json_utils.py with 8 unit tests covering: bare JSON, fenced JSON (json and plain), trailing whitespace, empty input, whitespace-only input, idempotency, and the negative case where backticks appear inside legitimate JSON values.
  • All 8 pass under pytest.
  • Verified end-to-end on examples/simple_task with anthropic/claude-opus-4-5 agent and anthropic/claude-haiku-4-5-20251001 judge: trajectory.status: completed, final_score: 1.0.

Scope

Intentionally minimal. Does not change the prompt to ask Claude not to fence (provider-dependent and fragile), does not migrate to Pydantic structured output mode (more invasive, would conflict with negative_criteria.py's existing schema-based path), and does not touch the retry loop. The defensive parse helper is the smallest surface area that fixes the actual observed failure across all providers.

Files changed

  • grading/runner/evals/output_llm/utils/json_utils.py (new, 26 lines)
  • grading/runner/evals/output_llm/main.py (2 line changes + 1 import)
  • grading/runner/evals/output_llm/negative_criteria.py (1 line change + 1 import)
  • grading/tests/__init__.py (new, empty)
  • grading/tests/test_json_utils.py (new, 8 tests)

The grading judge LLM call sets response_format to json_object, which
works natively for OpenAI and Gemini. For Anthropic, LiteLLM simulates
this via a system prompt hint rather than provider-side enforcement,
and Claude models still wrap structured JSON in markdown fences. The
parser at output_llm/main.py and output_llm/negative_criteria.py calls
json.loads / model_validate_json on the raw response and fails after
10 retries.

This adds a strip_json_fences helper applied at both parse sites. The
helper is idempotent: bare JSON passes through unchanged, so Gemini
and OpenAI behavior is unaffected.

Reproduction: configure grading_settings.json to use any
anthropic/claude-* model and run any task. Without this fix every
trajectory grades as 0.0 with 'Invalid JSON after 10 attempts'.

Tested with anthropic/claude-sonnet-4-5 and anthropic/claude-haiku-4-5
on examples/simple_task end-to-end.
@thisisvk45 thisisvk45 force-pushed the fix/grading-strip-json-fences branch from 33b1f90 to 8da7683 Compare April 27, 2026 19:18
thisisvk45 added a commit to thisisvk45/archipelago that referenced this pull request Apr 27, 2026
…wer log test

The agents/README.md had drifted from the codebase. The 'Available
agent IDs' section listed loop_agent, toolbelt_agent, and
singleshot_agent, but runner/agents/registry.py only registers
loop_agent and react_toolbelt_agent. The 'Creating a New Agent'
snippet used the wrong parameter name (input vs run_input) and a
truncated registry snippet that implied overwriting existing entries
rather than adding to them. The README also referenced
tests/test_final_answer_log.py as enforcing the agent contract, but
that file did not exist.

This PR:

1. Resyncs agents/README.md with current symbol names. Removes
   references to phantom agents. Fixes the 'Creating a New Agent'
   snippet to use run_input and the dict-update registration style.
   Updates the stale anthropic/claude-3-5-sonnet-20241022 model
   string to anthropic/claude-opus-4-5 to match paper Table 9.

2. Adds runner/agents/echo_agent as a 60-line reference
   implementation. It does not call any LLM or connect to MCP and
   is intended as the canonical hello-world for new contributors.

3. Adds tests/test_final_answer_log.py with three tight tests:
   every AgentConfigIds value has an AGENT_REGISTRY entry, every
   registered agent_impl is an async callable, and echo_agent
   emits exactly one final_answer log when run end-to-end. A
   follow-up PR can mock LiteLLM to extend end-to-end coverage to
   loop_agent and react_toolbelt_agent.

4. Adds CONTRIBUTING-AGENTS.md codifying the agent contract as a
   one-page checklist.

Out of scope: typed final_answer field on AgentTrajectoryOutput,
RunManifest replay harness for reproducibility issues Mercor-Intelligence#4 and Mercor-Intelligence#8.
The grading-side fence-stripping fix is in Mercor-Intelligence#52.
@thisisvk45
Copy link
Copy Markdown
Author

Verification update: ran the fix across the Claude judge family on examples/simple_task to confirm it is not Haiku-specific. Same agent (anthropic/claude-opus-4-5), same task, same harness build, only the judge model varies.

Judge model final_score runtime trajectory status notes
anthropic/claude-opus-4-5 1.0 38s completed clean parse
anthropic/claude-sonnet-4-5 1.0 41s completed clean parse
anthropic/claude-haiku-4-5-20251001 1.0 32s completed clean parse

All three judges return fenced JSON without the fix and parse cleanly with it. The 8 unit tests in this PR cover the helper, and these runs cover the integration path end-to-end across the Claude family.

Without the fix, every run in this table would have produced final_score: 0.0 with ValueError("Invalid JSON after 10 attempts"). The agent trajectory itself was identical across runs (same answer, same files touched), confirming the failure mode is judge-side, not agent-side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant