Skip to content

Add docs eval framework for A/B testing documentation variants#1521

Closed
devin-ai-integration[bot] wants to merge 10 commits intomainfrom
devin/1772862463-docs-eval-framework
Closed

Add docs eval framework for A/B testing documentation variants#1521
devin-ai-integration[bot] wants to merge 10 commits intomainfrom
devin/1772862463-docs-eval-framework

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented Mar 7, 2026

Adds an evaluation framework under internal/eval/ for A/B testing different documentation variants (e.g., different versions of llms.txt) to measure which docs best help LLMs write correct Modal code.

Requested by: Rahul Chalamala (rahul@modal.com)
Link to Devin Session

Summary

Core flow:

  1. Each Modal example is converted into a platform-agnostic task description (stored as YAML)
  2. A coding agent (Claude Code CLI, OpenAI Responses API, or Devin API — configurable) generates code based on a docs variant. Sandbox-based agents (Claude/Codex) run inside Modal Sandboxes with docs mounted as files; the Devin agent creates a session via the REST API with docs included in the prompt.
  3. Network isolation via cidr_allowlist restricts sandbox outbound traffic to only the LLM API endpoints (blocks modal.com, docs.modal.com, etc.). Devin agent relies on prompt-based isolation since it runs in its own environment.
  4. The generated code is extracted and evaluated on syntax validity, Modal API pattern usage, and LLM-as-judge scoring
  5. Results are aggregated into a comparison report across doc variants

Files added:

  • agents.py — Three agent backends: (1) Claude Code CLI in sandbox, (2) OpenAI Responses API in sandbox, (3) Devin via REST API. Includes sandbox image definitions, network isolation via cidr_allowlist, code extraction from solution files/stdout, and Devin session polling.
  • evaluator.py — Multi-dimensional evaluation: AST syntax check, regex-based Modal pattern matching, LLM-as-judge (uses direct API calls, separate from the sandbox agents)
  • runner.py — Modal app orchestrator that parallelizes eval runs via starmap. Dispatches to run_single_task (sandbox agents) or run_single_task_devin (Devin API) based on agent type, each with appropriate secrets.
  • tasks.pyEvalTask / EvalResult data models, YAML loading, docs variant loading
  • generate_tasks.py — Script to auto-generate task descriptions from examples using an LLM
  • tasks/*.yaml — 11 hand-written eval tasks spanning getting_started, web_endpoints, scaling, scheduling, sandboxes, and GPU/ML
  • docs/README.md — Instructions for adding new documentation variants

Network isolation approach:

  • resolve_cidr_allowlist() resolves LLM API hostnames (e.g. api.anthropic.com, api.openai.com) via DNS at runtime
  • Resolved IPs are broadened to /24 CIDR ranges to handle CDN IP rotation
  • cidr_allowlist is passed to Sandbox.create() — this restricts all outbound traffic to only those CIDRs
  • Note: block_network and cidr_allowlist are mutually exclusive in the Modal SDK; cidr_allowlist alone provides the restriction
  • Enabled by default; can be disabled with --no-network-isolation for debugging
  • Devin agent does not use cidr_allowlist — it runs in Devin's own environment, so isolation is prompt-based only

Tested and verified (e2e with codex agent):

  • Agent runs in sandbox, calls OpenAI API, generates code ✅
  • Code is extracted and written to /workspace/solution.py
  • Evaluator runs syntax checks and pattern matching ✅
  • Isolated sandbox: modal.com ❌ BLOCKED, docs.modal.com ❌ BLOCKED, api.openai.com ✅ ALLOWED

Type of Change

  • Other (Changes to the codebase, but not to examples)

Important review notes

  1. Codex agent uses OpenAI Responses API, not Codex CLI — The Codex CLI v0.111.0 requires OAuth login which is incompatible with headless Modal Sandboxes. Instead, the codex agent uses a Python script (_OPENAI_AGENT_SCRIPT) that calls the OpenAI Responses API directly using OPENAI_API_KEY env var. This is a single API call rather than iterative tool use. Review question: Is this approach acceptable, or should we add multi-turn tool-use behavior?

  2. Claude Code CLI not tested end-to-end — The Claude Code path has structural correctness (correct CLI flags, pty=True per sandbox_agent.py example, git repo initialization) but hasn't been tested e2e due to missing anthropic-secret in the workspace. The codex/OpenAI path has been fully tested and verified.

  3. Devin agent not tested end-to-end — The Devin API integration is structurally complete (session creation, polling, message extraction) but has not been tested due to missing devin-secret in the workspace. Key risks:

    • Message extraction logic relies on guessing response format (role, text, content fields) — not verified against actual API responses
    • Code extraction uses len(code) > 20 as minimum viable length — arbitrary and could miss short valid solutions
    • Session ID handling (devin_id = f"devin-{session_id}") follows docs pattern but hasn't been verified
    • The requests library is imported at module level in agents.py, which should be fine since it's pip-installed in eval_image, but could cause issues if the module is imported in unexpected contexts
  4. Code extraction relies on LLM cooperation — The OpenAI Responses API approach uses an embedded Python script (_OPENAI_AGENT_SCRIPT) that tries to extract code from markdown fences. If the LLM returns code without fences, the raw text is written. This has been tested and works with o3-mini, but edge cases may occur with different models or prompt variations.

  5. Network isolation limitations:

    • Uses /24 CIDR ranges from DNS resolution — if LLM API uses different subnets across regions, this could cause intermittent failures
    • DNS resolution happens at sandbox creation time — if the API uses a different IP when agent makes requests, it will be blocked
    • Modal Sandboxes may be in different regions with different IP paths to the APIs
  6. Nested sandbox creation — The runner creates sandboxes inside @app.function(). This is a supported pattern (see codelangchain/agent.py example), but adds nesting complexity.

  7. Hardcoded secret names — Requires Modal secrets named anthropic-secret, openai-secret, and devin-secret to be configured in the workspace.

  8. CLI installation reliability — Sandbox images install tools via curl https://claude.ai/install.sh | bash. This external URL could change or break.

Human review checklist

  • Verify the OpenAI Responses API approach meets requirements — Is a single API call acceptable, or should we add iterative agentic coding behavior?
  • Verify CLI flags for Claude Code (-p, --output-format text)
  • Add pty=True to the Claude Code sandbox.exec() call (✅ added in commit ca600ff)
  • Test Claude Code path end-to-end with anthropic-secret configured
  • Test Devin agent end-to-end with devin-secret configured — Critical since this path is completely untested
  • Verify Devin API message extraction logic works with actual responses
  • Verify network isolation works reliably across Modal regions/datacenters
  • Consider if /24 CIDR ranges are broad enough for CDN-backed LLM APIs

Updates since last revision

Latest (commit b4474e6):

  • Added Devin API as a third agent option — Alongside Claude Code CLI and OpenAI Responses API, the framework now supports using the Devin REST API (v3). The Devin agent:
    • Creates a Devin session via POST /v3/organizations/sessions with task + docs in the prompt
    • Polls GET /v3/organizations/sessions/{devin_id} until terminal status (exit, error, suspended)
    • Extracts code from session messages via GET /v3/organizations/sessions/{devin_id}/messages
    • Runs in Devin's own environment (not a Modal Sandbox), so network isolation is prompt-based
    • Requires a devin-secret with DEVIN_API_KEY
  • Refactored runner — Extracted _evaluate_and_build_result and _error_result helper functions. Split into two Modal functions: run_single_task (sandbox agents) and run_single_task_devin (Devin API) with separate secret requirements.
  • Added requests to eval_image — Required for Devin API HTTP calls

Previous (commit ca600ff):

  • Added pty=True for Claude Code CLI — Per the sandbox_agent.py example, Claude Code requires a PTY. This was previously flagged as missing and has now been added to the sandbox.exec() call for Claude Code.

Earlier (commit 8fc7088):

  • Replaced Codex CLI with OpenAI Responses API — The Codex CLI v0.111.0 requires OAuth login (incompatible with headless sandboxes). Now uses a Python script that calls client.responses.create() with the OpenAI SDK using OPENAI_API_KEY env var.
  • E2E tested and verified — Full pipeline works: agent runs in sandbox → calls OpenAI API → generates code → writes to /workspace/solution.py → evaluator runs syntax/pattern checks. Network isolation confirmed working (modal.com blocked, api.openai.com allowed).
  • Git repo initialization only for Claude — No longer runs git init for codex agent (not needed for API approach).
  • Fixed code extraction regex — Uses raw string r''' for _OPENAI_AGENT_SCRIPT to ensure correct escaping of regex patterns.

Initial implementation:

  • Network isolation implemented: Uses cidr_allowlist to restrict sandbox outbound traffic to only LLM API endpoints
    • DNS resolution at runtime builds /24 CIDR allowlist for api.anthropic.com and api.openai.com
    • Tested and verified: sandboxes block modal.com and docs.modal.com while allowing LLM APIs
    • Renamed block_network field to network_isolated for clarity (avoids confusion with Modal SDK's block_network parameter)
    • Added --no-network-isolation CLI flag for debugging
  • Replaced direct LLM API calls with actual coding agent CLIs running inside Modal Sandboxes
  • Agents run in isolated sandbox environments with docs mounted as files at /workspace/docs.txt
  • Code extraction from sandbox: tries /workspace/solution.py, then searches for .py files, then parses stdout
  • Evaluator (LLM-as-judge) unchanged — still uses direct API calls for scoring
  • Fixed f-string formatting in runner.py to pass CI ruff checks

Open with Devin

- Framework for evaluating which docs/llms.txt variants best help LLMs write Modal code
- Configurable agents: Claude (Anthropic) and Codex (OpenAI)
- Multi-dimensional evaluation: syntax checking, Modal API pattern matching, LLM-as-judge
- Task generation script to convert Modal examples to platform-agnostic descriptions
- 11 initial eval tasks covering getting_started, web_endpoints, scaling, scheduling, sandboxes, GPU/ML
- Parallel execution via Modal starmap for fast evaluation runs
- Comparison reporting across doc variants

Co-Authored-By: Rahul Chalamala <rahul@modal.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

devin-ai-integration Bot and others added 8 commits March 7, 2026 05:52
Co-Authored-By: Rahul Chalamala <rahul@modal.com>
…oxes

- Replace direct API calls with actual coding agent CLIs running in sandboxes
- Claude Code: installed via curl, runs with 'claude -p' in sandbox
- Codex CLI: installed via npm, runs with 'codex --prompt' in sandbox
- Docs mounted as files in sandbox at /workspace/docs.txt
- Agent writes solution to /workspace/solution.py
- Fallback: searches for .py files or extracts code from stdout
- Runner creates sandbox per task, evaluates generated code after extraction
- Evaluator (LLM-as-judge) unchanged - uses direct API calls for scoring

Co-Authored-By: Rahul Chalamala <rahul@modal.com>
…points

- Sandboxes now block all outbound network by default
- DNS resolution at runtime builds /24 CIDR allowlist for LLM API hosts
- Claude sandboxes allow api.anthropic.com, Codex sandboxes allow api.openai.com
- New --no-network-isolation CLI flag to disable for debugging
- AgentConfig gains block_network field (default True)

Co-Authored-By: Rahul Chalamala <rahul@modal.com>
block_network and cidr_allowlist are mutually exclusive in Modal SDK.
cidr_allowlist alone restricts outbound traffic to only the listed CIDRs.

Tested and verified:
- Isolated sandbox: modal.com BLOCKED, docs.modal.com BLOCKED, api.anthropic.com ALLOWED
- Open sandbox: modal.com ALLOWED, api.anthropic.com ALLOWED

Renamed block_network to network_isolated for clarity.

Co-Authored-By: Rahul Chalamala <rahul@modal.com>
…o -m syntax

- Change docs_variant from list[str]|None to str|None (Modal CLI can't parse union types)
- Support comma-separated variant names: --docs-variant v1,v2
- Update usage examples to use 'modal run -m internal.eval.runner'
- Simplify secrets to only require openai-secret (agents bring their own secrets)

Co-Authored-By: Rahul Chalamala <rahul@modal.com>
The Codex CLI v0.111.0 requires OAuth login which is incompatible with
headless Modal Sandboxes. Replace it with a Python script that calls the
OpenAI Responses API directly using OPENAI_API_KEY env var.

Also fix code extraction regex escaping and only init git repo for Claude
Code (not needed for the OpenAI API approach).

Co-Authored-By: Rahul Chalamala <rahul@modal.com>
Co-Authored-By: Rahul Chalamala <rahul@modal.com>
Co-Authored-By: Rahul Chalamala <rahul@modal.com>
Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment thread internal/eval/runner.py
Comment on lines +108 to +110
secrets=[
modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]),
],
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing anthropic-secret causes default judge to fail at runtime

Both run_single_task and run_single_task_devin only mount openai-secret, but the default judge_agent is "claude" (set at internal/eval/runner.py:284). When evaluation reaches judge_with_llm in internal/eval/evaluator.py:262-265, it creates an anthropic.Anthropic() client that reads ANTHROPIC_API_KEY from the environment. Since anthropic-secret is never mounted on these functions, the API key is missing and the judge call will fail with an authentication error. This affects every default invocation of the eval runner.

Suggested change
secrets=[
modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]),
],
secrets=[
modal.Secret.from_name("anthropic-secret", required_keys=["ANTHROPIC_API_KEY"]),
modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]),
],
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Rather than adding anthropic-secret as a required dependency (it doesn't exist in the workspace), I changed the default judge_agent from "claude" to "codex" (OpenAI) since openai-secret is the only available secret. Users can still pass --judge-agent claude if they have anthropic-secret configured. Fixed in commit b8acc05.

Comment thread internal/eval/runner.py
Comment on lines +151 to +154
secrets=[
modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]),
modal.Secret.from_name("devin-secret", required_keys=["DEVIN_API_KEY"]),
],
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing anthropic-secret in run_single_task_devin causes default judge to fail

Same issue as run_single_task: run_single_task_devin mounts openai-secret and devin-secret but not anthropic-secret. The default judge_agent="claude" will fail when judge_with_llm (internal/eval/evaluator.py:262-265) tries to use the Anthropic client without ANTHROPIC_API_KEY in the environment.

Suggested change
secrets=[
modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]),
modal.Secret.from_name("devin-secret", required_keys=["DEVIN_API_KEY"]),
],
secrets=[
modal.Secret.from_name("anthropic-secret", required_keys=["ANTHROPIC_API_KEY"]),
modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]),
modal.Secret.from_name("devin-secret", required_keys=["DEVIN_API_KEY"]),
],
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix applied — changed the default judge_agent from "claude" to "codex" so it uses openai-secret (which is available) by default. Users can still opt into --judge-agent claude if they configure anthropic-secret. See commit b8acc05.

…es missing anthropic-secret)

Co-Authored-By: Rahul Chalamala <rahul@modal.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

This PR is stale because it has been open 30 days with no activity.
If the stale label remains and there are no comments, this will be closed in 5 days.

@github-actions github-actions Bot added the Stale label Apr 9, 2026
@github-actions
Copy link
Copy Markdown

This PR was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions Bot closed this Apr 15, 2026
@github-actions github-actions Bot deleted the devin/1772862463-docs-eval-framework branch April 15, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants