fix(grading): strip markdown code fences from judge JSON responses#52
Open
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
Open
fix(grading): strip markdown code fences from judge JSON responses#52thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
Conversation
The grading judge LLM call sets response_format to json_object, which works natively for OpenAI and Gemini. For Anthropic, LiteLLM simulates this via a system prompt hint rather than provider-side enforcement, and Claude models still wrap structured JSON in markdown fences. The parser at output_llm/main.py and output_llm/negative_criteria.py calls json.loads / model_validate_json on the raw response and fails after 10 retries. This adds a strip_json_fences helper applied at both parse sites. The helper is idempotent: bare JSON passes through unchanged, so Gemini and OpenAI behavior is unaffected. Reproduction: configure grading_settings.json to use any anthropic/claude-* model and run any task. Without this fix every trajectory grades as 0.0 with 'Invalid JSON after 10 attempts'. Tested with anthropic/claude-sonnet-4-5 and anthropic/claude-haiku-4-5 on examples/simple_task end-to-end.
33b1f90 to
8da7683
Compare
thisisvk45
added a commit
to thisisvk45/archipelago
that referenced
this pull request
Apr 27, 2026
…wer log test The agents/README.md had drifted from the codebase. The 'Available agent IDs' section listed loop_agent, toolbelt_agent, and singleshot_agent, but runner/agents/registry.py only registers loop_agent and react_toolbelt_agent. The 'Creating a New Agent' snippet used the wrong parameter name (input vs run_input) and a truncated registry snippet that implied overwriting existing entries rather than adding to them. The README also referenced tests/test_final_answer_log.py as enforcing the agent contract, but that file did not exist. This PR: 1. Resyncs agents/README.md with current symbol names. Removes references to phantom agents. Fixes the 'Creating a New Agent' snippet to use run_input and the dict-update registration style. Updates the stale anthropic/claude-3-5-sonnet-20241022 model string to anthropic/claude-opus-4-5 to match paper Table 9. 2. Adds runner/agents/echo_agent as a 60-line reference implementation. It does not call any LLM or connect to MCP and is intended as the canonical hello-world for new contributors. 3. Adds tests/test_final_answer_log.py with three tight tests: every AgentConfigIds value has an AGENT_REGISTRY entry, every registered agent_impl is an async callable, and echo_agent emits exactly one final_answer log when run end-to-end. A follow-up PR can mock LiteLLM to extend end-to-end coverage to loop_agent and react_toolbelt_agent. 4. Adds CONTRIBUTING-AGENTS.md codifying the agent contract as a one-page checklist. Out of scope: typed final_answer field on AgentTrajectoryOutput, RunManifest replay harness for reproducibility issues Mercor-Intelligence#4 and Mercor-Intelligence#8. The grading-side fence-stripping fix is in Mercor-Intelligence#52.
Author
|
Verification update: ran the fix across the Claude judge family on
All three judges return fenced JSON without the fix and parse cleanly with it. The 8 unit tests in this PR cover the helper, and these runs cover the integration path end-to-end across the Claude family. Without the fix, every run in this table would have produced |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The grading judge call sets
response_format={"type": "json_object"}, which is honored natively by OpenAI and Gemini providers. For Anthropic, LiteLLM simulates this via prompt-level instruction rather than provider-side enforcement, and Claude models still wrap structured output in markdown code fences. The result: every grading attempt with a Claude judge fails parsing, retries 10 times producing identical fenced output, then raisesValueError("Invalid JSON after 10 attempts"). Trajectories grade as0.0regardless of agent quality.This patch adds a small, idempotent
strip_json_fences()helper applied at both judge-response parse sites. Bare JSON passes through unchanged, so behavior for Gemini and OpenAI users is unaffected.Why this matters
Anyone running Archipelago with their own Anthropic key (the most common case for external researchers verifying leaderboard scores) hits this on the first task and assumes the agent failed when actually only the judge failed. The agent trajectory is correct; only the parse step breaks. This is plausibly a contributor to reproducibility discrepancies reported in #4 and #8 when reviewers attempt verification with different judge models.
Reproduction (before this fix)
examples/simple_task/grading_settings.jsonto anyanthropic/claude-*model.examples/simple_task/run.sh.grades.jsonshowingfinal_score: 0.0with the log line:The JSON itself is valid; only the markdown wrapping breaks the parser.
Diagnosis
Two affected call sites in
grading/runner/evals/output_llm/:main.py:326callsjson.loads(raw_content)inside a 10-retry loop. Falls through tomodel_validate_jsonat line 339 which also fails.negative_criteria.py:196callsmodel_validate_json(neg_raw_content)with no retry loop. First fenced response is fatal.The prompt at
utils/prompts.py:253says "Respond with a JSON object" but does not constrain markdown formatting. Anthropic models interpret this as "format JSON readably" and fence it. Gemini happens to comply with bare output.Fix
New helper at
grading/runner/evals/output_llm/utils/json_utils.py:Applied at both parse sites. ~5 production lines. No changes to LLM call interface, retry logic, or prompts.
Testing
grading/tests/test_json_utils.pywith 8 unit tests covering: bare JSON, fenced JSON (jsonand plain), trailing whitespace, empty input, whitespace-only input, idempotency, and the negative case where backticks appear inside legitimate JSON values.examples/simple_taskwithanthropic/claude-opus-4-5agent andanthropic/claude-haiku-4-5-20251001judge:trajectory.status: completed,final_score: 1.0.Scope
Intentionally minimal. Does not change the prompt to ask Claude not to fence (provider-dependent and fragile), does not migrate to Pydantic structured output mode (more invasive, would conflict with
negative_criteria.py's existing schema-based path), and does not touch the retry loop. The defensive parse helper is the smallest surface area that fixes the actual observed failure across all providers.Files changed
grading/runner/evals/output_llm/utils/json_utils.py(new, 26 lines)grading/runner/evals/output_llm/main.py(2 line changes + 1 import)grading/runner/evals/output_llm/negative_criteria.py(1 line change + 1 import)grading/tests/__init__.py(new, empty)grading/tests/test_json_utils.py(new, 8 tests)