Skip to content
120 changes: 120 additions & 0 deletions agents-multi-repo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Agents for Multi-Repo Changes

> Documents the new agents introduced to handle multi-repo PRDs, what each one does, and how they integrate into the existing eng-team workflow.

---

## Updated Workflow

```
Orchestrator
└── Tech Lead (multi-repo plan + blast radius analysis)
└── Contract Agent (locks interface changes before anyone implements)
└── Engineer × N (parallel, one per repo, gated on contract finalization)
└── Reviewer (cross-repo, reads all diffs together)
```

The Orchestrator drives the full pipeline. The core sequence — Tech Lead → Engineer → Reviewer — stays the same. Two things change: the Contract Agent is inserted between planning and implementation, and the Engineer step becomes parallel across repos.

---

## Existing Agents — What Changes

### Orchestrator (extended, not replaced)

The Orchestrator already handles sequencing and agent coordination. For multi-repo changes it needs two behavioral additions:

- **Parallel engineer dispatch** — spins up one Engineer agent per repo rather than always one
- **Contract gate enforcement** — Engineer agents that depend on a contract change are blocked from starting until the Contract Agent has finalized and published the contract artifact

No new agent is needed here. This is a logic and configuration extension.

### Tech Lead (extended)

In addition to its existing responsibilities, the Tech Lead must:

- Read the **system-level architecture document** (service inventory, dependency graph, cross-cutting conventions) before writing the plan
- Identify which service boundaries the PRD crosses
- Produce a **cross-repo plan artifact** that maps each acceptance criterion to a specific repo and lists every file expected to change per repo
- Flag any public API surface, event schema, or shared type that will be modified — this is the signal that triggers the Contract Agent

The plan artifact is the Orchestrator's input for deciding whether to invoke the Contract Agent and how many Engineer agents to spin up.

### Engineer (extended)

Each Engineer agent operates on a single repo, same as before. The changes are:

- Multiple instances run in parallel, one per repo
- Instances that depend on a contract change receive the finalized contract artifact as additional context before starting
- Instances working on independent repos (no shared contract dependency) start immediately in parallel

### Reviewer (extended)

The Reviewer receives all diffs across all repos simultaneously and adds one additional check to its existing checklist:

- Does Service B's usage of the new API match what Service A implemented?
- Is there any service that calls the changed interface that was not included in the plan?
- Are all contract changes backward compatible, or is there a coordinated breaking change with an explicit migration plan?

The Reviewer requires the dependency graph from the system-level architecture document to know which services to check — it cannot discover blast radius from diffs alone.

---

## New Agent: Contract Agent

### When it is invoked

The Orchestrator invokes the Contract Agent when either of the following is true:

- The PRD touches more than one repo
- The PRD touches a single repo but the Tech Lead's plan flags a change to a public API surface, event schema, or shared library interface

For purely internal single-repo changes (business logic, UI, infra config with no public interface change) the Contract Agent is skipped entirely.

### What it does

The Contract Agent owns the interface boundary between services. Its job is to produce a **contract artifact** — a precise, versioned definition of what is changing at the service boundary — and lock it before any Engineer agent starts implementing.

Specifically it:

1. Reads the Tech Lead's plan artifact and the current interface definitions (OpenAPI specs, proto files, shared types, event schemas) for all affected services
2. Produces a diff of the contract change — what is being added, modified, or removed at the interface boundary
3. Checks backward compatibility — flags breaking changes and requires an explicit migration or versioning plan if any exist
4. Publishes the finalized contract artifact so downstream Engineer agents can use it as a source of truth
5. Blocks the Orchestrator from starting any dependent Engineer agent until the artifact is published

### What it does not do

- It does not write implementation code
- It does not modify business logic
- It does not review the final diffs — that is the Reviewer's job

### Output

A contract artifact containing:
- The precise interface change (structured diff of the API surface)
- Backward compatibility assessment (compatible / breaking + migration plan)
- A list of all services that consume the changed interface, derived from the dependency graph

---

## Sequencing Rules

| Condition | Contract Agent | Engineer agents |
|---|---|---|
| Single-repo, no public interface change | Skipped | One agent, starts immediately |
| Single-repo, public interface change | Invoked | Starts after contract is finalized |
| Multi-repo, independent services (no shared contract) | Skipped | All agents start in parallel immediately |
| Multi-repo, shared contract change | Invoked | Contract-dependent agents wait; independent agents start immediately |

---

## Summary

One new agent is introduced: the **Contract Agent**. It fills the gap that exists in the current workflow — there was no role responsible for locking interface changes before implementation begins. Without it, parallel Engineer agents make independent assumptions about the same interface, and mismatches only surface at review or integration.

Everything else — Orchestrator, Tech Lead, Engineer, Reviewer — retains its existing role and gains scoped extensions to handle multi-repo context and parallel execution.

---

*Document authored from eng-team architectural discussion — May 2026.*
141 changes: 141 additions & 0 deletions docs/code-as-agent-harness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Code as Agent Harness — Implications for eng-team

Summary of how [Code as Agent Harness](https://arxiv.org/abs/2605.18747) (Ning et al., 2026) relates to **eng-team**, and a prioritized backlog for strengthening the harness.

---

## Article in one paragraph

The survey argues that in agentic systems, **code is not only output** — it is the **operational harness**: the executable substrate for reasoning, acting, environment modeling, and verification. A good harness makes behavior **executable, inspectable, stateful, and verifiable** over long horizons. Progress depends as much on harness engineering (tools, memory, oracles, control loops, multi-agent shared state) as on the base model.

---

## How eng-team already fits

eng-team is a **code-centric agent harness** for the slice PRD → spec → implementation → review → merge-ready PR:

| Paper layer | eng-team today |
|-------------|----------------|
| **Harness interface** | `CLAUDE.md`, `technical_spec`, Engineer edits, Reviewer `git diff` |
| **Harness mechanisms** | Orchestrator phases, bounded loops, `repo_context`, test/lint gates |
| **Multi-agent over code** | Tech Lead → Engineer → Reviewer via `.eng_team/task_*.json` (orchestrator-only; no peer chat) |
| **Verifiable closure** | Tests + linter + structured review checklist |

This aligns with **PHILOSOPHY.md**: bottom-up trust, narrow insertion point, diff-based review (output over intent).

The article does **not** suggest replacing this design. It names what to harden next: **oracle quality**, **shared state discipline**, **harness telemetry**, and **governed iteration**.

---

## Key upgrades (article → eng-team)

### 1. Scratchpad as program state

Extend `.eng_team/task_*.json` beyond narrative logging:

- `verification_evidence` (tests run, linter result, diff stats)
- `assumptions[]` with `verified_by` (test / diff hunk / reviewer item)
- Per-phase `read_set` / `write_set`
- Commit pins: `base_commit`, `spec_version`, `impl_commit`

*Paper: §2.3, §4.2, §5.2.4 — transactional shared program state.*

### 2. Verification stack (not only “tests passed”)

On approve, require an **evidence bundle** and explicit limits:

- What was checked (unit / integration / security hints / coverage on touched files)
- `untested_regions[]` — what the oracle does **not** prove
- For `complex` tasks: runnable `acceptance_checks` or test skeletons in the spec

*Paper: §5.2.1–5.2.2 — oracle adequacy and semantic verification beyond executable feedback.*

### 3. Harness-level evaluation

Log per-run **trajectory metrics** in the scratchpad:

- Phase durations, clarification/review cycles
- Recovery: each `critical_issue` linked to a fix commit
- `oracle_strength` (trivial vs full checklist, targeted re-review scope)

*Paper: §5.2.1 — evaluate the harness, not only final task success.*

### 4. Failure-type routing in the orchestrator

Route feedback by signal type:

| Signal | Action |
|--------|--------|
| `spec_gaps` | Tech Lead (max 1 cycle — existing) |
| Test failure | Engineer fix mode |
| Lint only | Engineer, narrow scope |
| Behavior vs spec | Tech Lead, not blind Engineer patch |
| Security/perf | Reviewer targeted re-review |

*Paper: §3.4 — plan → execute → verify with feedback-driven control.*

### 5. Action validation (lightweight harness boundary)

Pre-flight before Engineer acts:

- Edits only under `files_to_modify` / `files_to_create`
- No edits on `base_branch`
- Bash allowlist from `CLAUDE.md` (no destructive or secret-leaking commands)

*Paper: §2.2 — code mediates intent; filter invalid actions before execution.*

### 6. Human gates as durable state

Scratchpad fields: `human_gates` (`prd_approved`, `spec_approved`, `merge_approved`), `human_resolution` on escalation so later runs do not repeat the same failure.

*Paper: §5.2.5; **PHILOSOPHY.md** — the gate that stays human.*

### 7. Cross-task memory (optional, later)

`.eng_team/learnings.json` for recurring reviewer findings, flaky areas, repo-specific patterns — opt-in, governed.

*Paper: §3.2 — memory and context engineering.*

### 8. Harness evolution with regression discipline

Golden fixture repos + expected scratchpad phases; prompt/checklist changes only with held-out regression tasks and explicit change contracts.

*Paper: §5.2.3 — self-evolving harnesses without regression.*

---

## What to keep (already strong)

- Bottom-up, verifiable slice (code → tests → diff review)
- Orchestrator-owned control flow; bounded loops; targeted re-review
- Role/tool separation (Tech Lead no Edit; Reviewer judges diff not intent)
- `/eng-team-context` as environment bootstrapping
- Scratchpad as audit trail

---

## Prioritized backlog

| Priority | Change | Paper reference |
|----------|--------|-----------------|
| **P0** | Evidence bundle + `untested_regions` on approve | §5.2.2 |
| **P0** | Commit pins + `spec_version` on scratchpad | §4.2, §5.2.4 |
| **P1** | Trajectory / harness metrics in every task JSON | §5.2.1 |
| **P1** | Failure-type routing in orchestrator | §3.4 |
| **P2** | Engineer file-scope + bash policy enforcement | §2.2 |
| **P2** | `acceptance_checks` for `complex` specs | §2.1 |
| **P3** | Cross-task `.eng_team/learnings.json` | §3.2 |
| **P3** | Golden-repo harness regression tests | §5.2.3 |

---

## Bottom line

eng-team is already a **code-as-harness** system for software engineering. The survey’s main push is to evolve from **prompt orchestration that usually works** to **harness engineering**: every approval carries proof, every phase carries versioned assumptions, and harness failures improve the system with regression discipline — without widening scope beyond the PRD → PR slice until trust is earned.

---

## Reference

- **Paper:** [Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems](https://arxiv.org/abs/2605.18747)
- **Related repo docs:** `PHILOSOPHY.md`, `README.md`, `.claude/commands/eng-team.md`
96 changes: 96 additions & 0 deletions quality-gates-multi-repo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Quality Gates in a Multi-Repo / Microservices Environment

> Extending the single-repo quality philosophy to systems where logic is split across services, shared libraries, and utilities — and where one PRD may touch more than one repo.

---

## Why Single-Repo Gates Are Not Enough

The quality gates designed for a single repo assume one CLAUDE.md, one test suite, one diff to review. In a microservices environment, the failure modes multiply:

- An agent has full context on Service A but no idea Service B even exists
- A schema change in a shared library silently breaks three consumers
- Two engineer agents modify overlapping contracts in parallel with no coordination
- The reviewer only sees one diff but the bug lives in the interaction between two services

Each of these requires a gate that simply doesn't exist in the single-repo model.

---

## New Layer Required: System-Level Context

Each repo keeps its own `CLAUDE.md` for local conventions. But multi-repo changes require an additional layer — an org-level architecture document that every agent reads before planning.

This document covers:

- **Service inventory** — what each service owns, its public API surface, who calls it
- **Dependency graph** — which services depend on which, where contracts live (OpenAPI specs, proto files, shared types)
- **Cross-cutting conventions** — auth patterns, error formats, event schemas — things that must be consistent across all services

This is not a living wiki maintained by hand. It should be partially auto-generated from actual API specs, import graphs, and event bus subscriptions — so it reflects the real system, not someone's memory of it.

---

## Gate 1: Contract-First Planning

For any multi-repo PRD, the tech-lead's plan artifact must answer: *which service boundaries does this change cross?*

Any change that touches an API contract, event schema, or shared type must be declared upfront — before any engineer agent starts writing code. The plan names the contract change explicitly. All downstream service changes are derived from it.

This enforces the right order of operations: **contracts first, implementations second.** Agents cannot drift into incompatible assumptions if the contract is locked before they start.

---

## Gate 2: Sequenced Parallel Execution

Multiple engineer agents can work in parallel on separate services — but only after the contract is settled.

The coordination rule: **no agent touches a service that depends on a contract change until that contract change is finalized.**

This is a sequencing constraint, not a quality check. Violating it means two agents make independent assumptions about the same interface, and both may be wrong in ways that only surface at integration time.

---

## Gate 3: Cross-Repo Adversarial Reviewer

The single-repo reviewer reads one diff. In multi-repo, the reviewer must read all diffs together and specifically check:

- Are all contract changes backward compatible — or is there a coordinated breaking change with a migration plan?
- Does Service B's usage of the new API actually match what Service A implemented?
- Is there a service that calls the changed interface that wasn't included in the plan?

This reviewer requires the dependency graph from the system-level context layer to know which services to check. It cannot discover blast radius from the diffs alone.

---

## Gate 4: Contract Tests as the Quality Floor

In a single repo, the test suite is the quality floor. In multi-repo, the equivalent is **contract tests** — consumer-driven tests that run against the producer's implementation.

Every service that publishes an API should have contract tests defined by its consumers. These are the only automated checks that can catch cross-service incompatibilities before integration. Unit tests and linting within each service will not surface interface mismatches.

---

## Revised Implementation Priority (Multi-Repo)

| Priority | Gate | What it addresses |
|---|---|---|
| 1 | System-level architecture document | Gives agents cross-service context |
| 2 | Per-repo `CLAUDE.md` | Gives agents local conventions |
| 3 | Contract-first plan artifact | Prevents incompatible parallel implementations |
| 4 | Sequenced parallel execution | Enforces contract-before-consumer ordering |
| 5 | Full test suite per repo + contract tests | Sets the automated quality floor |
| 6 | Cross-repo adversarial reviewer | Catches interface mismatches across diffs |
| 7 | Feedback capture loop | Compounds quality improvements over time |

---

## The Unsolved Problem

The system-level architecture document is only as good as its maintenance discipline. In a fast-moving microservices environment, the dependency graph goes stale quickly.

The real answer is that this document needs to be auto-generated — derived from actual API specs, import graphs, and event bus subscriptions — not maintained by hand. Until that tooling exists, the document is a useful approximation, not a guarantee. Treat it as the best available context, and build agents that flag when they encounter service references not covered by it.

---

*Document authored from eng-team architectural discussion — May 2026.*
Loading