docs(agents): sync README with codebase, add echo_agent and final_answer log test#53
Open
thisisvk45 wants to merge 1 commit intoMercor-Intelligence:mainfrom
Open
Conversation
…wer log test The agents/README.md had drifted from the codebase. The 'Available agent IDs' section listed loop_agent, toolbelt_agent, and singleshot_agent, but runner/agents/registry.py only registers loop_agent and react_toolbelt_agent. The 'Creating a New Agent' snippet used the wrong parameter name (input vs run_input) and a truncated registry snippet that implied overwriting existing entries rather than adding to them. The README also referenced tests/test_final_answer_log.py as enforcing the agent contract, but that file did not exist. This PR: 1. Resyncs agents/README.md with current symbol names. Removes references to phantom agents. Fixes the 'Creating a New Agent' snippet to use run_input and the dict-update registration style. Updates the stale anthropic/claude-3-5-sonnet-20241022 model string to anthropic/claude-opus-4-5 to match paper Table 9. 2. Adds runner/agents/echo_agent as a 60-line reference implementation. It does not call any LLM or connect to MCP and is intended as the canonical hello-world for new contributors. 3. Adds tests/test_final_answer_log.py with three tight tests: every AgentConfigIds value has an AGENT_REGISTRY entry, every registered agent_impl is an async callable, and echo_agent emits exactly one final_answer log when run end-to-end. A follow-up PR can mock LiteLLM to extend end-to-end coverage to loop_agent and react_toolbelt_agent. 4. Adds CONTRIBUTING-AGENTS.md codifying the agent contract as a one-page checklist. Out of scope: typed final_answer field on AgentTrajectoryOutput, RunManifest replay harness for reproducibility issues Mercor-Intelligence#4 and Mercor-Intelligence#8. The grading-side fence-stripping fix is in Mercor-Intelligence#52.
7dfc387 to
6845c96
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
agents/README.mdhad drifted from the actual codebase. The "Available agent IDs" section listedloop_agent,toolbelt_agent, andsingleshot_agent, butrunner/agents/registry.pyonly registersloop_agentandreact_toolbelt_agent. The "Creating a New Agent" snippet usedinputfor the run parameter (actual code usesrun_input) and a truncatedAGENT_REGISTRYsnippet that implied overwriting existing entries rather than adding to them. The README also referencedtests/test_final_answer_log.pyas enforcing the agent contract, but that file did not exist in the repo.This PR resyncs the README with code, adds a minimal
echo_agentas a canonical reference implementation, creates the missing test file (tight scope, see below), and codifies the agent contract that the README claimed was enforced.Why this matters
agents/README.mdis the primary entry point for anyone adding a new agent to the registry. Doc drift here makes the contributor floor higher than it needs to be, and the existing snippet produced non-functional code on first try. The missing test file meant the contract was claimed but not actually checked.Changes
1.
agents/README.mdresyncedSymbol names match
registry.pyandmodels.py. References to phantom agents removed. The "Creating a New Agent" snippet now usesrun_inputand shows registration via dict update rather than a truncated literal. Stale model stringanthropic/claude-3-5-sonnet-20241022updated toanthropic/claude-opus-4-5to match the model used in the paper's Table 9. Added explicit "Agent contract" section listing the three guarantees every registered agent must satisfy.2.
runner/agents/echo_agent/addedAbout 60 lines. Does not call any LLM and does not connect to MCP. Reads the last user message, echoes it back as an assistant message, emits the
final_answerlog vialogger.bind(message_type="final_answer").info(answer), returns a validAgentTrajectoryOutputwithstatus=COMPLETED. Intended as the simplest possible reference implementation for the contract. Also serves as the only agent that can be exercised end-to-end in tests without mocking LiteLLM.3.
tests/test_final_answer_log.pyadded (tight scope)Three tests:
AgentConfigIdsenum value has a correspondingAGENT_REGISTRYentryagent_implis an async callableecho_agentend-to-end run emits exactly onefinal_answerlog with the correctmessage_typebindingThe third test is the only one that exercises an agent end-to-end. A follow-up PR can add mocked LiteLLM coverage so
loop_agentandreact_toolbelt_agentget the same end-to-end check. Tight scope here keeps this PR focused on closing the doc-drift gap rather than introducing a test framework.4.
agents/CONTRIBUTING-AGENTS.mdaddedOne-page checklist for adding a new agent: enum entry, run signature, registry entry, final_answer log requirement, verification step. Mirrors the contract documented in the README so contributors have a single page to consult.
Testing
tests/test_final_answer_log.pypassecho_agentruns end-to-end inexamples/simple_taskstyle invocationOut of scope
final_answerfield onAgentTrajectoryOutput(currently a log-channel convention, worth promoting to a return field but a larger refactor)loop_agentandreact_toolbelt_agent(follow-up PR; this PR keeps test scope tight)RunManifestand replay subcommand for reproducibility (separate PR, will reference open issues Reproduction discrepancy: Kimi K2.5 Thinking scores lower than reported on Law domain #4 and Inconsistency in GLM4.7 Official Reported Mean Score #8)Files changed
agents/README.md(rewrite)agents/CONTRIBUTING-AGENTS.md(new)agents/runner/agents/echo_agent/__init__.py(new, empty)agents/runner/agents/echo_agent/main.py(new, ~60 lines)agents/runner/agents/models.py(1 line: enum entry)agents/runner/agents/registry.py(3 lines: import plus AgentDefn entry)agents/tests/__init__.py(new, empty)agents/tests/test_final_answer_log.py(new, 3 tests)