Add docs eval framework for A/B testing documentation variants#1521
Add docs eval framework for A/B testing documentation variants#1521devin-ai-integration[bot] wants to merge 10 commits intomainfrom
Conversation
- Framework for evaluating which docs/llms.txt variants best help LLMs write Modal code - Configurable agents: Claude (Anthropic) and Codex (OpenAI) - Multi-dimensional evaluation: syntax checking, Modal API pattern matching, LLM-as-judge - Task generation script to convert Modal examples to platform-agnostic descriptions - 11 initial eval tasks covering getting_started, web_endpoints, scaling, scheduling, sandboxes, GPU/ML - Parallel execution via Modal starmap for fast evaluation runs - Comparison reporting across doc variants Co-Authored-By: Rahul Chalamala <rahul@modal.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Co-Authored-By: Rahul Chalamala <rahul@modal.com>
…oxes - Replace direct API calls with actual coding agent CLIs running in sandboxes - Claude Code: installed via curl, runs with 'claude -p' in sandbox - Codex CLI: installed via npm, runs with 'codex --prompt' in sandbox - Docs mounted as files in sandbox at /workspace/docs.txt - Agent writes solution to /workspace/solution.py - Fallback: searches for .py files or extracts code from stdout - Runner creates sandbox per task, evaluates generated code after extraction - Evaluator (LLM-as-judge) unchanged - uses direct API calls for scoring Co-Authored-By: Rahul Chalamala <rahul@modal.com>
…points - Sandboxes now block all outbound network by default - DNS resolution at runtime builds /24 CIDR allowlist for LLM API hosts - Claude sandboxes allow api.anthropic.com, Codex sandboxes allow api.openai.com - New --no-network-isolation CLI flag to disable for debugging - AgentConfig gains block_network field (default True) Co-Authored-By: Rahul Chalamala <rahul@modal.com>
block_network and cidr_allowlist are mutually exclusive in Modal SDK. cidr_allowlist alone restricts outbound traffic to only the listed CIDRs. Tested and verified: - Isolated sandbox: modal.com BLOCKED, docs.modal.com BLOCKED, api.anthropic.com ALLOWED - Open sandbox: modal.com ALLOWED, api.anthropic.com ALLOWED Renamed block_network to network_isolated for clarity. Co-Authored-By: Rahul Chalamala <rahul@modal.com>
…o -m syntax - Change docs_variant from list[str]|None to str|None (Modal CLI can't parse union types) - Support comma-separated variant names: --docs-variant v1,v2 - Update usage examples to use 'modal run -m internal.eval.runner' - Simplify secrets to only require openai-secret (agents bring their own secrets) Co-Authored-By: Rahul Chalamala <rahul@modal.com>
The Codex CLI v0.111.0 requires OAuth login which is incompatible with headless Modal Sandboxes. Replace it with a Python script that calls the OpenAI Responses API directly using OPENAI_API_KEY env var. Also fix code extraction regex escaping and only init git repo for Claude Code (not needed for the OpenAI API approach). Co-Authored-By: Rahul Chalamala <rahul@modal.com>
Co-Authored-By: Rahul Chalamala <rahul@modal.com>
Co-Authored-By: Rahul Chalamala <rahul@modal.com>
| secrets=[ | ||
| modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]), | ||
| ], |
There was a problem hiding this comment.
🔴 Missing anthropic-secret causes default judge to fail at runtime
Both run_single_task and run_single_task_devin only mount openai-secret, but the default judge_agent is "claude" (set at internal/eval/runner.py:284). When evaluation reaches judge_with_llm in internal/eval/evaluator.py:262-265, it creates an anthropic.Anthropic() client that reads ANTHROPIC_API_KEY from the environment. Since anthropic-secret is never mounted on these functions, the API key is missing and the judge call will fail with an authentication error. This affects every default invocation of the eval runner.
| secrets=[ | |
| modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]), | |
| ], | |
| secrets=[ | |
| modal.Secret.from_name("anthropic-secret", required_keys=["ANTHROPIC_API_KEY"]), | |
| modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]), | |
| ], |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Good catch. Rather than adding anthropic-secret as a required dependency (it doesn't exist in the workspace), I changed the default judge_agent from "claude" to "codex" (OpenAI) since openai-secret is the only available secret. Users can still pass --judge-agent claude if they have anthropic-secret configured. Fixed in commit b8acc05.
| secrets=[ | ||
| modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]), | ||
| modal.Secret.from_name("devin-secret", required_keys=["DEVIN_API_KEY"]), | ||
| ], |
There was a problem hiding this comment.
🔴 Missing anthropic-secret in run_single_task_devin causes default judge to fail
Same issue as run_single_task: run_single_task_devin mounts openai-secret and devin-secret but not anthropic-secret. The default judge_agent="claude" will fail when judge_with_llm (internal/eval/evaluator.py:262-265) tries to use the Anthropic client without ANTHROPIC_API_KEY in the environment.
| secrets=[ | |
| modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]), | |
| modal.Secret.from_name("devin-secret", required_keys=["DEVIN_API_KEY"]), | |
| ], | |
| secrets=[ | |
| modal.Secret.from_name("anthropic-secret", required_keys=["ANTHROPIC_API_KEY"]), | |
| modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]), | |
| modal.Secret.from_name("devin-secret", required_keys=["DEVIN_API_KEY"]), | |
| ], |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Same fix applied — changed the default judge_agent from "claude" to "codex" so it uses openai-secret (which is available) by default. Users can still opt into --judge-agent claude if they configure anthropic-secret. See commit b8acc05.
…es missing anthropic-secret) Co-Authored-By: Rahul Chalamala <rahul@modal.com>
|
This PR is stale because it has been open 30 days with no activity. |
|
This PR was closed because it has been stalled for 5 days with no activity. |
Adds an evaluation framework under
internal/eval/for A/B testing different documentation variants (e.g., different versions ofllms.txt) to measure which docs best help LLMs write correct Modal code.Requested by: Rahul Chalamala (rahul@modal.com)
Link to Devin Session
Summary
Core flow:
cidr_allowlistrestricts sandbox outbound traffic to only the LLM API endpoints (blocksmodal.com,docs.modal.com, etc.). Devin agent relies on prompt-based isolation since it runs in its own environment.Files added:
agents.py— Three agent backends: (1) Claude Code CLI in sandbox, (2) OpenAI Responses API in sandbox, (3) Devin via REST API. Includes sandbox image definitions, network isolation viacidr_allowlist, code extraction from solution files/stdout, and Devin session polling.evaluator.py— Multi-dimensional evaluation: AST syntax check, regex-based Modal pattern matching, LLM-as-judge (uses direct API calls, separate from the sandbox agents)runner.py— Modal app orchestrator that parallelizes eval runs viastarmap. Dispatches torun_single_task(sandbox agents) orrun_single_task_devin(Devin API) based on agent type, each with appropriate secrets.tasks.py—EvalTask/EvalResultdata models, YAML loading, docs variant loadinggenerate_tasks.py— Script to auto-generate task descriptions from examples using an LLMtasks/*.yaml— 11 hand-written eval tasks spanning getting_started, web_endpoints, scaling, scheduling, sandboxes, and GPU/MLdocs/README.md— Instructions for adding new documentation variantsNetwork isolation approach:
resolve_cidr_allowlist()resolves LLM API hostnames (e.g.api.anthropic.com,api.openai.com) via DNS at runtime/24CIDR ranges to handle CDN IP rotationcidr_allowlistis passed toSandbox.create()— this restricts all outbound traffic to only those CIDRsblock_networkandcidr_allowlistare mutually exclusive in the Modal SDK;cidr_allowlistalone provides the restriction--no-network-isolationfor debuggingcidr_allowlist— it runs in Devin's own environment, so isolation is prompt-based onlyTested and verified (e2e with codex agent):
/workspace/solution.py✅modal.com❌ BLOCKED,docs.modal.com❌ BLOCKED,api.openai.com✅ ALLOWEDType of Change
Important review notes
Codex agent uses OpenAI Responses API, not Codex CLI — The Codex CLI v0.111.0 requires OAuth login which is incompatible with headless Modal Sandboxes. Instead, the codex agent uses a Python script (
_OPENAI_AGENT_SCRIPT) that calls the OpenAI Responses API directly usingOPENAI_API_KEYenv var. This is a single API call rather than iterative tool use. Review question: Is this approach acceptable, or should we add multi-turn tool-use behavior?Claude Code CLI not tested end-to-end — The Claude Code path has structural correctness (correct CLI flags,
pty=Truepersandbox_agent.pyexample, git repo initialization) but hasn't been tested e2e due to missinganthropic-secretin the workspace. The codex/OpenAI path has been fully tested and verified.Devin agent not tested end-to-end — The Devin API integration is structurally complete (session creation, polling, message extraction) but has not been tested due to missing
devin-secretin the workspace. Key risks:role,text,contentfields) — not verified against actual API responseslen(code) > 20as minimum viable length — arbitrary and could miss short valid solutionsdevin_id = f"devin-{session_id}") follows docs pattern but hasn't been verifiedrequestslibrary is imported at module level inagents.py, which should be fine since it's pip-installed ineval_image, but could cause issues if the module is imported in unexpected contextsCode extraction relies on LLM cooperation — The OpenAI Responses API approach uses an embedded Python script (
_OPENAI_AGENT_SCRIPT) that tries to extract code from markdown fences. If the LLM returns code without fences, the raw text is written. This has been tested and works with o3-mini, but edge cases may occur with different models or prompt variations.Network isolation limitations:
/24CIDR ranges from DNS resolution — if LLM API uses different subnets across regions, this could cause intermittent failuresNested sandbox creation — The runner creates sandboxes inside
@app.function(). This is a supported pattern (seecodelangchain/agent.pyexample), but adds nesting complexity.Hardcoded secret names — Requires Modal secrets named
anthropic-secret,openai-secret, anddevin-secretto be configured in the workspace.CLI installation reliability — Sandbox images install tools via
curl https://claude.ai/install.sh | bash. This external URL could change or break.Human review checklist
-p,--output-format text)Add(✅ added in commit ca600ff)pty=Trueto the Claude Codesandbox.exec()callanthropic-secretconfigureddevin-secretconfigured — Critical since this path is completely untested/24CIDR ranges are broad enough for CDN-backed LLM APIsUpdates since last revision
Latest (commit b4474e6):
POST /v3/organizations/sessionswith task + docs in the promptGET /v3/organizations/sessions/{devin_id}until terminal status (exit,error,suspended)GET /v3/organizations/sessions/{devin_id}/messagesdevin-secretwithDEVIN_API_KEY_evaluate_and_build_resultand_error_resulthelper functions. Split into two Modal functions:run_single_task(sandbox agents) andrun_single_task_devin(Devin API) with separate secret requirements.requeststo eval_image — Required for Devin API HTTP callsPrevious (commit ca600ff):
pty=Truefor Claude Code CLI — Per thesandbox_agent.pyexample, Claude Code requires a PTY. This was previously flagged as missing and has now been added to thesandbox.exec()call for Claude Code.Earlier (commit 8fc7088):
client.responses.create()with the OpenAI SDK usingOPENAI_API_KEYenv var./workspace/solution.py→ evaluator runs syntax/pattern checks. Network isolation confirmed working (modal.com blocked, api.openai.com allowed).r'''for_OPENAI_AGENT_SCRIPTto ensure correct escaping of regex patterns.Initial implementation:
cidr_allowlistto restrict sandbox outbound traffic to only LLM API endpoints/24CIDR allowlist forapi.anthropic.comandapi.openai.commodal.comanddocs.modal.comwhile allowing LLM APIsblock_networkfield tonetwork_isolatedfor clarity (avoids confusion with Modal SDK'sblock_networkparameter)--no-network-isolationCLI flag for debugging/workspace/docs.txt/workspace/solution.py, then searches for.pyfiles, then parses stdoutrunner.pyto pass CI ruff checks