diff --git a/agents-multi-repo.md b/agents-multi-repo.md new file mode 100644 index 0000000..cb268da --- /dev/null +++ b/agents-multi-repo.md @@ -0,0 +1,120 @@ +# Agents for Multi-Repo Changes + +> Documents the new agents introduced to handle multi-repo PRDs, what each one does, and how they integrate into the existing eng-team workflow. + +--- + +## Updated Workflow + +``` +Orchestrator + └── Tech Lead (multi-repo plan + blast radius analysis) + └── Contract Agent (locks interface changes before anyone implements) + └── Engineer × N (parallel, one per repo, gated on contract finalization) + └── Reviewer (cross-repo, reads all diffs together) +``` + +The Orchestrator drives the full pipeline. The core sequence — Tech Lead → Engineer → Reviewer — stays the same. Two things change: the Contract Agent is inserted between planning and implementation, and the Engineer step becomes parallel across repos. + +--- + +## Existing Agents — What Changes + +### Orchestrator (extended, not replaced) + +The Orchestrator already handles sequencing and agent coordination. For multi-repo changes it needs two behavioral additions: + +- **Parallel engineer dispatch** — spins up one Engineer agent per repo rather than always one +- **Contract gate enforcement** — Engineer agents that depend on a contract change are blocked from starting until the Contract Agent has finalized and published the contract artifact + +No new agent is needed here. This is a logic and configuration extension. + +### Tech Lead (extended) + +In addition to its existing responsibilities, the Tech Lead must: + +- Read the **system-level architecture document** (service inventory, dependency graph, cross-cutting conventions) before writing the plan +- Identify which service boundaries the PRD crosses +- Produce a **cross-repo plan artifact** that maps each acceptance criterion to a specific repo and lists every file expected to change per repo +- Flag any public API surface, event schema, or shared type that will be modified — this is the signal that triggers the Contract Agent + +The plan artifact is the Orchestrator's input for deciding whether to invoke the Contract Agent and how many Engineer agents to spin up. + +### Engineer (extended) + +Each Engineer agent operates on a single repo, same as before. The changes are: + +- Multiple instances run in parallel, one per repo +- Instances that depend on a contract change receive the finalized contract artifact as additional context before starting +- Instances working on independent repos (no shared contract dependency) start immediately in parallel + +### Reviewer (extended) + +The Reviewer receives all diffs across all repos simultaneously and adds one additional check to its existing checklist: + +- Does Service B's usage of the new API match what Service A implemented? +- Is there any service that calls the changed interface that was not included in the plan? +- Are all contract changes backward compatible, or is there a coordinated breaking change with an explicit migration plan? + +The Reviewer requires the dependency graph from the system-level architecture document to know which services to check — it cannot discover blast radius from diffs alone. + +--- + +## New Agent: Contract Agent + +### When it is invoked + +The Orchestrator invokes the Contract Agent when either of the following is true: + +- The PRD touches more than one repo +- The PRD touches a single repo but the Tech Lead's plan flags a change to a public API surface, event schema, or shared library interface + +For purely internal single-repo changes (business logic, UI, infra config with no public interface change) the Contract Agent is skipped entirely. + +### What it does + +The Contract Agent owns the interface boundary between services. Its job is to produce a **contract artifact** — a precise, versioned definition of what is changing at the service boundary — and lock it before any Engineer agent starts implementing. + +Specifically it: + +1. Reads the Tech Lead's plan artifact and the current interface definitions (OpenAPI specs, proto files, shared types, event schemas) for all affected services +2. Produces a diff of the contract change — what is being added, modified, or removed at the interface boundary +3. Checks backward compatibility — flags breaking changes and requires an explicit migration or versioning plan if any exist +4. Publishes the finalized contract artifact so downstream Engineer agents can use it as a source of truth +5. Blocks the Orchestrator from starting any dependent Engineer agent until the artifact is published + +### What it does not do + +- It does not write implementation code +- It does not modify business logic +- It does not review the final diffs — that is the Reviewer's job + +### Output + +A contract artifact containing: +- The precise interface change (structured diff of the API surface) +- Backward compatibility assessment (compatible / breaking + migration plan) +- A list of all services that consume the changed interface, derived from the dependency graph + +--- + +## Sequencing Rules + +| Condition | Contract Agent | Engineer agents | +|---|---|---| +| Single-repo, no public interface change | Skipped | One agent, starts immediately | +| Single-repo, public interface change | Invoked | Starts after contract is finalized | +| Multi-repo, independent services (no shared contract) | Skipped | All agents start in parallel immediately | +| Multi-repo, shared contract change | Invoked | Contract-dependent agents wait; independent agents start immediately | + +--- + +## Summary + +One new agent is introduced: the **Contract Agent**. It fills the gap that exists in the current workflow — there was no role responsible for locking interface changes before implementation begins. Without it, parallel Engineer agents make independent assumptions about the same interface, and mismatches only surface at review or integration. + +Everything else — Orchestrator, Tech Lead, Engineer, Reviewer — retains its existing role and gains scoped extensions to handle multi-repo context and parallel execution. + +--- + +*Document authored from eng-team architectural discussion — May 2026.* diff --git a/docs/code-as-agent-harness.md b/docs/code-as-agent-harness.md new file mode 100644 index 0000000..504924d --- /dev/null +++ b/docs/code-as-agent-harness.md @@ -0,0 +1,141 @@ +# Code as Agent Harness — Implications for eng-team + +Summary of how [Code as Agent Harness](https://arxiv.org/abs/2605.18747) (Ning et al., 2026) relates to **eng-team**, and a prioritized backlog for strengthening the harness. + +--- + +## Article in one paragraph + +The survey argues that in agentic systems, **code is not only output** — it is the **operational harness**: the executable substrate for reasoning, acting, environment modeling, and verification. A good harness makes behavior **executable, inspectable, stateful, and verifiable** over long horizons. Progress depends as much on harness engineering (tools, memory, oracles, control loops, multi-agent shared state) as on the base model. + +--- + +## How eng-team already fits + +eng-team is a **code-centric agent harness** for the slice PRD → spec → implementation → review → merge-ready PR: + +| Paper layer | eng-team today | +|-------------|----------------| +| **Harness interface** | `CLAUDE.md`, `technical_spec`, Engineer edits, Reviewer `git diff` | +| **Harness mechanisms** | Orchestrator phases, bounded loops, `repo_context`, test/lint gates | +| **Multi-agent over code** | Tech Lead → Engineer → Reviewer via `.eng_team/task_*.json` (orchestrator-only; no peer chat) | +| **Verifiable closure** | Tests + linter + structured review checklist | + +This aligns with **PHILOSOPHY.md**: bottom-up trust, narrow insertion point, diff-based review (output over intent). + +The article does **not** suggest replacing this design. It names what to harden next: **oracle quality**, **shared state discipline**, **harness telemetry**, and **governed iteration**. + +--- + +## Key upgrades (article → eng-team) + +### 1. Scratchpad as program state + +Extend `.eng_team/task_*.json` beyond narrative logging: + +- `verification_evidence` (tests run, linter result, diff stats) +- `assumptions[]` with `verified_by` (test / diff hunk / reviewer item) +- Per-phase `read_set` / `write_set` +- Commit pins: `base_commit`, `spec_version`, `impl_commit` + +*Paper: §2.3, §4.2, §5.2.4 — transactional shared program state.* + +### 2. Verification stack (not only “tests passed”) + +On approve, require an **evidence bundle** and explicit limits: + +- What was checked (unit / integration / security hints / coverage on touched files) +- `untested_regions[]` — what the oracle does **not** prove +- For `complex` tasks: runnable `acceptance_checks` or test skeletons in the spec + +*Paper: §5.2.1–5.2.2 — oracle adequacy and semantic verification beyond executable feedback.* + +### 3. Harness-level evaluation + +Log per-run **trajectory metrics** in the scratchpad: + +- Phase durations, clarification/review cycles +- Recovery: each `critical_issue` linked to a fix commit +- `oracle_strength` (trivial vs full checklist, targeted re-review scope) + +*Paper: §5.2.1 — evaluate the harness, not only final task success.* + +### 4. Failure-type routing in the orchestrator + +Route feedback by signal type: + +| Signal | Action | +|--------|--------| +| `spec_gaps` | Tech Lead (max 1 cycle — existing) | +| Test failure | Engineer fix mode | +| Lint only | Engineer, narrow scope | +| Behavior vs spec | Tech Lead, not blind Engineer patch | +| Security/perf | Reviewer targeted re-review | + +*Paper: §3.4 — plan → execute → verify with feedback-driven control.* + +### 5. Action validation (lightweight harness boundary) + +Pre-flight before Engineer acts: + +- Edits only under `files_to_modify` / `files_to_create` +- No edits on `base_branch` +- Bash allowlist from `CLAUDE.md` (no destructive or secret-leaking commands) + +*Paper: §2.2 — code mediates intent; filter invalid actions before execution.* + +### 6. Human gates as durable state + +Scratchpad fields: `human_gates` (`prd_approved`, `spec_approved`, `merge_approved`), `human_resolution` on escalation so later runs do not repeat the same failure. + +*Paper: §5.2.5; **PHILOSOPHY.md** — the gate that stays human.* + +### 7. Cross-task memory (optional, later) + +`.eng_team/learnings.json` for recurring reviewer findings, flaky areas, repo-specific patterns — opt-in, governed. + +*Paper: §3.2 — memory and context engineering.* + +### 8. Harness evolution with regression discipline + +Golden fixture repos + expected scratchpad phases; prompt/checklist changes only with held-out regression tasks and explicit change contracts. + +*Paper: §5.2.3 — self-evolving harnesses without regression.* + +--- + +## What to keep (already strong) + +- Bottom-up, verifiable slice (code → tests → diff review) +- Orchestrator-owned control flow; bounded loops; targeted re-review +- Role/tool separation (Tech Lead no Edit; Reviewer judges diff not intent) +- `/eng-team-context` as environment bootstrapping +- Scratchpad as audit trail + +--- + +## Prioritized backlog + +| Priority | Change | Paper reference | +|----------|--------|-----------------| +| **P0** | Evidence bundle + `untested_regions` on approve | §5.2.2 | +| **P0** | Commit pins + `spec_version` on scratchpad | §4.2, §5.2.4 | +| **P1** | Trajectory / harness metrics in every task JSON | §5.2.1 | +| **P1** | Failure-type routing in orchestrator | §3.4 | +| **P2** | Engineer file-scope + bash policy enforcement | §2.2 | +| **P2** | `acceptance_checks` for `complex` specs | §2.1 | +| **P3** | Cross-task `.eng_team/learnings.json` | §3.2 | +| **P3** | Golden-repo harness regression tests | §5.2.3 | + +--- + +## Bottom line + +eng-team is already a **code-as-harness** system for software engineering. The survey’s main push is to evolve from **prompt orchestration that usually works** to **harness engineering**: every approval carries proof, every phase carries versioned assumptions, and harness failures improve the system with regression discipline — without widening scope beyond the PRD → PR slice until trust is earned. + +--- + +## Reference + +- **Paper:** [Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems](https://arxiv.org/abs/2605.18747) +- **Related repo docs:** `PHILOSOPHY.md`, `README.md`, `.claude/commands/eng-team.md` diff --git a/quality-gates-multi-repo.md b/quality-gates-multi-repo.md new file mode 100644 index 0000000..27d3cf9 --- /dev/null +++ b/quality-gates-multi-repo.md @@ -0,0 +1,96 @@ +# Quality Gates in a Multi-Repo / Microservices Environment + +> Extending the single-repo quality philosophy to systems where logic is split across services, shared libraries, and utilities — and where one PRD may touch more than one repo. + +--- + +## Why Single-Repo Gates Are Not Enough + +The quality gates designed for a single repo assume one CLAUDE.md, one test suite, one diff to review. In a microservices environment, the failure modes multiply: + +- An agent has full context on Service A but no idea Service B even exists +- A schema change in a shared library silently breaks three consumers +- Two engineer agents modify overlapping contracts in parallel with no coordination +- The reviewer only sees one diff but the bug lives in the interaction between two services + +Each of these requires a gate that simply doesn't exist in the single-repo model. + +--- + +## New Layer Required: System-Level Context + +Each repo keeps its own `CLAUDE.md` for local conventions. But multi-repo changes require an additional layer — an org-level architecture document that every agent reads before planning. + +This document covers: + +- **Service inventory** — what each service owns, its public API surface, who calls it +- **Dependency graph** — which services depend on which, where contracts live (OpenAPI specs, proto files, shared types) +- **Cross-cutting conventions** — auth patterns, error formats, event schemas — things that must be consistent across all services + +This is not a living wiki maintained by hand. It should be partially auto-generated from actual API specs, import graphs, and event bus subscriptions — so it reflects the real system, not someone's memory of it. + +--- + +## Gate 1: Contract-First Planning + +For any multi-repo PRD, the tech-lead's plan artifact must answer: *which service boundaries does this change cross?* + +Any change that touches an API contract, event schema, or shared type must be declared upfront — before any engineer agent starts writing code. The plan names the contract change explicitly. All downstream service changes are derived from it. + +This enforces the right order of operations: **contracts first, implementations second.** Agents cannot drift into incompatible assumptions if the contract is locked before they start. + +--- + +## Gate 2: Sequenced Parallel Execution + +Multiple engineer agents can work in parallel on separate services — but only after the contract is settled. + +The coordination rule: **no agent touches a service that depends on a contract change until that contract change is finalized.** + +This is a sequencing constraint, not a quality check. Violating it means two agents make independent assumptions about the same interface, and both may be wrong in ways that only surface at integration time. + +--- + +## Gate 3: Cross-Repo Adversarial Reviewer + +The single-repo reviewer reads one diff. In multi-repo, the reviewer must read all diffs together and specifically check: + +- Are all contract changes backward compatible — or is there a coordinated breaking change with a migration plan? +- Does Service B's usage of the new API actually match what Service A implemented? +- Is there a service that calls the changed interface that wasn't included in the plan? + +This reviewer requires the dependency graph from the system-level context layer to know which services to check. It cannot discover blast radius from the diffs alone. + +--- + +## Gate 4: Contract Tests as the Quality Floor + +In a single repo, the test suite is the quality floor. In multi-repo, the equivalent is **contract tests** — consumer-driven tests that run against the producer's implementation. + +Every service that publishes an API should have contract tests defined by its consumers. These are the only automated checks that can catch cross-service incompatibilities before integration. Unit tests and linting within each service will not surface interface mismatches. + +--- + +## Revised Implementation Priority (Multi-Repo) + +| Priority | Gate | What it addresses | +|---|---|---| +| 1 | System-level architecture document | Gives agents cross-service context | +| 2 | Per-repo `CLAUDE.md` | Gives agents local conventions | +| 3 | Contract-first plan artifact | Prevents incompatible parallel implementations | +| 4 | Sequenced parallel execution | Enforces contract-before-consumer ordering | +| 5 | Full test suite per repo + contract tests | Sets the automated quality floor | +| 6 | Cross-repo adversarial reviewer | Catches interface mismatches across diffs | +| 7 | Feedback capture loop | Compounds quality improvements over time | + +--- + +## The Unsolved Problem + +The system-level architecture document is only as good as its maintenance discipline. In a fast-moving microservices environment, the dependency graph goes stale quickly. + +The real answer is that this document needs to be auto-generated — derived from actual API specs, import graphs, and event bus subscriptions — not maintained by hand. Until that tooling exists, the document is a useful approximation, not a guarantee. Treat it as the best available context, and build agents that flag when they encounter service references not covered by it. + +--- + +*Document authored from eng-team architectural discussion — May 2026.* diff --git a/quality-gates.md b/quality-gates.md new file mode 100644 index 0000000..6ebf697 --- /dev/null +++ b/quality-gates.md @@ -0,0 +1,130 @@ +# Quality Gates for Autonomous Engineering Teams + +> How to ensure eng-team agents always produce high-quality code, stay on scope, and build human trust in AI-generated PRs. + +--- + +## Why AI Code Quality Degrades + +Before picking gates, it helps to name the failure modes: + +- **Too little context** — the agent doesn't know the repo's conventions, so it invents patterns. +- **Too little scope discipline** — the agent over-engineers because nothing stops it. +- **No verification loop** — the agent writes code and hands it off without checking if it actually works or matches intent. +- **No adversarial review** — the same agent that wrote the code also "reviewed" it. + +Each gate in this document targets one or more of these root causes. + +--- + +## Stage 1: Pre-flight (Before a Single Line Is Written) + +The highest-leverage point is *before* implementation starts. The agent must produce a **plan artifact** — a structured document that states: + +- Which files will change +- Rough line count estimate +- Implementation approach +- How the approach maps to each acceptance criterion in the PRD + +This costs almost nothing and surfaces the biggest risks before wasted compute. + +The plan is checked against: + +**Scope reasonableness** — If the plan touches more than a threshold number of files or LOC for a small feature, that's a flag to surface before implementation begins. + +**Repo structure alignment** — Does the plan follow existing module boundaries, naming conventions, and architectural patterns? A well-maintained `CLAUDE.md` is the primary mechanism here — treat it as a constitution that all agents must read and cite in their plan. + +**Test-first commitment** — The agent declares what tests it will write before writing any implementation. This forces real thinking about the contract, not just the code. + +--- + +## Stage 2: In-flight Controls (While Implementing) + +**Incremental, reviewable commits** — Rather than one giant diff at the end, each logical chunk (a new function, a schema change, a new component) should be a discrete commit. This makes the diff auditable incrementally and makes it far easier to spot drift. + +**Self-critique step** — After writing each logical unit, the agent reads its own diff and answers: +- Is this the minimum change needed? +- Does it follow the pattern used elsewhere in the codebase? +- Am I introducing anything that wasn't in the PRD? + +Catching drift mid-implementation is dramatically cheaper than catching it at review. + +--- + +## Stage 3: Post-implementation Gates (Before PR Is Opened) + +These are the mechanical, automated checks that form the quality floor. + +### Tests Must Pass +The full existing test suite must pass before a PR is opened. If the agent breaks tests, the PR does not open. This is enforced mechanically, not left to the agent's judgment. + +### Test Coverage on New Code +The agent is required to write tests for its own additions. Coverage thresholds apply to the **diff** — not just overall repo coverage — to catch cases where the agent ships logic with zero tests. + +### Static Analysis and Linting +TypeScript strict mode, ESLint, formatters, and any other repo-configured tools must pass at zero-tolerance. The agent runs and fixes these locally before the PR opens. + +### Diff Size Audit +Compare the size of the PR (files changed, LOC) against the stated complexity of the PRD. A one-sentence feature request that produces a 1200-line PR is a signal worth surfacing — it doesn't mean the PR is wrong, but it should trigger human scrutiny before merge. + +### File Blast Radius Check +Which files were modified? If the agent touched a shared utility, a config file, or anything outside the expected module scope, that must be explicitly flagged in the PR description. Unexpected file changes are one of the most common sources of subtle regressions. + +--- + +## Stage 4: The Adversarial Reviewer Agent + +This is the highest-trust gate and the most important one to get right. + +**The agent that writes the code must never be the sole reviewer.** + +A separate agent instance — with fresh context and no attachment to the implementation — reads the PRD and the diff, then answers a structured checklist: + +- Does every acceptance criterion have corresponding code and a test? +- Is there any code that wasn't required by the PRD? +- Are there patterns that diverge from the existing codebase? +- Are there obvious edge cases not handled? +- Is the PR description accurate and complete? + +The output is a **structured review report** attached to the PR. When the human reviewer opens the PR, they see the AI reviewer's assessment alongside the diff — surfacing disagreements, flags, and open questions. This reduces the cognitive load on the human reviewer and focuses their attention where it matters. + +--- + +## Stage 5: Building Trust Over Time + +The gates above catch bad output in the moment. Sustained trust requires a feedback loop. + +**Capture human corrections** — Every time a human reviewer modifies an AI-generated PR, that change should be captured — as an annotated example or a `CLAUDE.md` update. This creates a growing library of "this is what we do here and why," progressively calibrating future agents to the team's standards. + +**Retrospective evals** — Periodically sample merged AI PRs, strip context, and ask a fresh agent: "How would you implement this PRD given this codebase?" If the approach diverges significantly from what was merged, the agents are drifting from what the team actually wants. Use those diffs to improve the `CLAUDE.md` and agent prompts. + +--- + +## Implementation Priority + +Implement in this order for the best return on investment: + +| Priority | Gate | What it addresses | +|---|---|---| +| 1 | `CLAUDE.md` with explicit conventions | Gives agents repo context | +| 2 | Plan artifact + scope check | Catches over-engineering before it happens | +| 3 | Full test suite enforcement | Sets a non-negotiable quality floor | +| 4 | Adversarial reviewer agent | Builds human trust most directly | +| 5 | Diff size + blast radius audit | Catches subtle scope creep | +| 6 | Feedback capture loop | Compounds quality improvements over time | + +--- + +## Summary + +No single gate is sufficient because the failure modes are different at each stage. The combination of: + +- A strong **pre-flight** (scope discipline + plan artifact) +- **Mechanical post-implementation gates** (tests, linting, diff audit) +- An **adversarial reviewer** (independent judgment on correctness and fit) + +...covers the three biggest failure modes. The rest is refinement and iteration as the team builds its feedback corpus. + +--- + +*Document authored from eng-team architectural discussion — May 2026.*