From 64835c26241d15f687010a413a4c8fca14c044ba Mon Sep 17 00:00:00 2001 From: Prashant Mishra Date: Mon, 18 May 2026 16:03:08 +0530 Subject: [PATCH 1/8] perf: reduce redundant file reads in eng-team workflow - tech-lead extracts CLAUDE.md into repo_context scratchpad key (test/lint commands, branch convention, style notes) so downstream agents skip re-reading it - tech-lead extracts style_conventions from touched files into the spec so engineer skips adjacent-file reads during implementation - tech-lead issues all Grep/Glob lookups in one parallel batch before reading files Co-Authored-By: Claude Sonnet 4.6 --- .claude/agents/engineer.md | 16 +++++++++++----- .claude/agents/tech-lead.md | 35 ++++++++++++++++++++++++++++++----- 2 files changed, 41 insertions(+), 10 deletions(-) diff --git a/.claude/agents/engineer.md b/.claude/agents/engineer.md index 66c5701..231fed6 100644 --- a/.claude/agents/engineer.md +++ b/.claude/agents/engineer.md @@ -27,8 +27,14 @@ Read the scratchpad. Study `technical_spec` carefully: - `approach` — implementation strategy to follow - `test_approach` — what to test and how - `acceptance_criteria` — your definition of done +- `style_conventions` — style rules to follow when writing or modifying code -Also read `CLAUDE.md` for code style conventions and how to run tests. +Also read `repo_context` from the scratchpad top level (written by the Tech Lead): +- `test_command` — use this to run the test suite in Steps 4 and 5 +- `lint_command` — use this to run the linter in Step 5 +- `style_notes` — supplementary style rules for the whole repo + +Do **not** re-read `CLAUDE.md` — everything you need is already in the scratchpad. ### Step 2 — Set up the branch ```bash @@ -44,10 +50,10 @@ git checkout -b 2>/dev/null || git checkout Work through `files_to_create` and `files_to_modify` in order. After each logical unit of work, run the relevant tests — don't save all testing for the end. **For each file you create or modify:** -1. Read any adjacent files first to match style exactly (indentation, import order, naming) +1. Read the file you are editing (or the closest existing file in the same module if creating a new one) to confirm local style. Use `technical_spec.style_conventions` and `repo_context.style_notes` for guidance — do **not** read surrounding files just for style reference. 2. Implement the change 3. Write or extend the corresponding test file immediately -4. Run the tests for that file: check the test command in CLAUDE.md +4. Run the tests for that file using `repo_context.test_command` **Test writing rules:** - Follow the existing test file structure and naming convention exactly @@ -56,10 +62,10 @@ Work through `files_to_create` and `files_to_modify` in order. After each logica - Do not write tests that always pass — they must actually assert the behaviour ### Step 4 — Run the full test suite -Once all changes are done, run the full test suite (command in CLAUDE.md). Fix any failures before committing. If a pre-existing test breaks, investigate — do not delete it. +Once all changes are done, run the full test suite using `repo_context.test_command` from the scratchpad. Fix any failures before committing. If a pre-existing test breaks, investigate — do not delete it. ### Step 5 — Run the linter -Run the project linter (command in CLAUDE.md or check `package.json` scripts / `pyproject.toml`). Fix all errors. Warnings are acceptable if they pre-existed. +Run the linter using `repo_context.lint_command` from the scratchpad. Fix all errors. Warnings are acceptable if they pre-existed. ### Step 6 — Commit Stage only the files you intentionally changed: diff --git a/.claude/agents/tech-lead.md b/.claude/agents/tech-lead.md index b3e6d7b..5c27574 100644 --- a/.claude/agents/tech-lead.md +++ b/.claude/agents/tech-lead.md @@ -16,15 +16,27 @@ If your instructions contain a `Spec gaps to clarify:` section, you are in **cla ## Process -### Step 1 — Read CLAUDE.md +### Step 1 — Read CLAUDE.md and extract repo context Read `CLAUDE.md` in the repo root first. It contains the architecture overview, key directories, conventions, and test commands. This is your primary knowledge source — do not re-explore things it already documents. +Extract the following into a `repo_context` object (you will write this to the scratchpad — downstream agents read it from there instead of re-reading CLAUDE.md): +- `test_command` — the exact command to run the full test suite +- `lint_command` — the exact command to run the linter +- `branch_convention` — the branch naming pattern (e.g. `feat/`, `fix/`) +- `key_directories` — list of important directories from the architecture overview +- `style_notes` — any explicit style rules (indentation, import order, naming, framework conventions) + ### Step 2 — Targeted exploration only -Based on the PRD and CLAUDE.md, identify what you still need to understand. Do NOT scan the whole repo — only look at what is directly relevant: +Based on the PRD and CLAUDE.md, identify what you still need to understand. Do NOT scan the whole repo — only look at what is directly relevant. + +**Issue all Grep and Glob calls in a single parallel batch before reading any files.** This means: form the complete list of symbols, routes, or modules you need to locate, then call Grep/Glob for all of them at once. Only after all results come back, issue Read calls for the files you actually need. + - Use `Grep` to find the specific routes, handlers, or modules the PRD touches - Use `Read` to read those files (not the whole directory) - Use `Glob` only to confirm file locations if CLAUDE.md is unclear +While reading the files the PRD touches, extract style conventions for the relevant module: indentation style, import ordering, naming patterns, any framework-specific idioms. You will write these as `style_conventions` in the spec so the Engineer can match style without re-reading adjacent files. + ### Step 3 — Decide the branch name Choose a branch name following the repo's convention (check `git branch -a` output or CLAUDE.md). Format: `type/short-description` e.g. `feat/cart-rate-limiter`, `fix/order-null-pointer`. @@ -39,7 +51,19 @@ Before writing the spec, classify the change: Write a focused Technical Spec into the scratchpad. Keep it concrete — file paths, function names, config keys. The Engineer must be able to implement without guessing. ## Output -Update the scratchpad JSON with this object under the key `"technical_spec"`: +Update the scratchpad JSON with the `repo_context` object at the **top level** (alongside `technical_spec`, not nested inside it): + +```json +"repo_context": { + "test_command": "npm test", + "lint_command": "npm run lint", + "branch_convention": "feat/ or fix/ prefix", + "key_directories": ["src/routes", "src/services", "tests/"], + "style_notes": "2-space indent, single quotes, named exports, Jest for tests" +} +``` + +Then update the scratchpad JSON with this object under the key `"technical_spec"`: ```json { @@ -71,7 +95,8 @@ Update the scratchpad JSON with this object under the key `"technical_spec"`: "Describe the observable outcome that confirms the feature works", "Describe the failure/edge case that must also be handled" ], - "out_of_scope": ["List anything the PRD implies but this spec intentionally defers"] + "out_of_scope": ["List anything the PRD implies but this spec intentionally defers"], + "style_conventions": "Concise prose: indentation, import ordering, naming conventions, and any framework idioms observed in the files this spec touches. The Engineer uses this to match style without reading adjacent files." } ``` @@ -85,7 +110,7 @@ When invoked with `Spec gaps to clarify:`, read the current spec from the scratc 2. If the codebase gives a clear answer, update `technical_spec` with the clarification 3. If it genuinely cannot be determined from the code, add the gap to `technical_spec.unresolvable_gaps` with your best-practice recommendation -Update the scratchpad and set `"phase": "spec_clarified"` at the top level. +Update the scratchpad and set `"phase": "spec_clarified"` at the top level. Do not overwrite `repo_context` — it was written in the initial run and is still valid. ## Rules - Be specific. Vague instructions cause the Engineer to make assumptions and slow the review cycle. From 02fe5ca155890b3cbfcc7daebb4643afb0e23511 Mon Sep 17 00:00:00 2001 From: Prashant Mishra Date: Tue, 19 May 2026 13:57:51 +0530 Subject: [PATCH 2/8] Philosophy docs added --- EXECUTIVE_SUMMARY.md | 29 ++++++++++++++++++ PHILOSOPHY.human.md | 73 ++++++++++++++++++++++++++++++++++++++++++++ PHILOSOPHY.md | 73 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 175 insertions(+) create mode 100644 EXECUTIVE_SUMMARY.md create mode 100644 PHILOSOPHY.human.md create mode 100644 PHILOSOPHY.md diff --git a/EXECUTIVE_SUMMARY.md b/EXECUTIVE_SUMMARY.md new file mode 100644 index 0000000..a309c01 --- /dev/null +++ b/EXECUTIVE_SUMMARY.md @@ -0,0 +1,29 @@ +# Executive Summary + +## Most AI in engineering is still just faster autocomplete + +Teams using AI today are mostly accelerating individual developers — a prompt here, a code suggestion there, a human accepting or rejecting each output. That's useful, but the ceiling is low. A human is still in the loop for every decision. The real opportunity is autonomous delivery: hand the system a requirement, get back working, tested, reviewed code. + +## We start where AI can actually be verified + +Engineering work sits on a spectrum. Writing code, running tests, reviewing a diff for correctness — these are close to computer systems. Either the tests pass or they don't. Engineers already have the tools to catch when something is wrong. Product decisions, design judgment, business tradeoffs — these require context and experience no system can verify yet. Starting there is the wrong bet. + +`eng-team` starts at the verifiable end: the path from a written requirement to a pull request. A tech lead agent turns the requirement into a technical spec. An engineer agent implements and tests it. A reviewer agent checks the output the way a senior engineer would. The team doesn't change how it works. Product still owns the roadmap. Engineers still decide what merges. The only thing that changes is the coding and review cycle runs autonomously. + +## Trust is earned, not declared + +You don't ask a team to trust a system they've never seen work. A team that watches AI reliably ship correct, reviewed, tested code for a few months has a fundamentally different relationship with the technology than one that's only read about it. That track record is what earns the right to automate the next layer. + +## The system grows upward as that trust accumulates + +Right now the system takes a human-written requirement as its input. The next layer is an agent that converts a rough business goal into a structured, unambiguous spec before any engineering starts — eliminating most of the rework that comes from building against the wrong interpretation of a requirement. + +Above that: QA, deployment, production monitoring, eventually agents that assist product and design decisions. Each new layer generates cleaner inputs for the layer below, pushing ambiguity further upstream — until it reaches the one place it belongs, the human who knows what the business actually needs. + +## One gate always stays human + +Before engineering work begins, a person confirms the system understood the goal correctly. Not to rewrite anything — just to verify intent was captured right. That's a business checkpoint, not a technical one, and it belongs to a person regardless of how far the system grows. + +## This is not about replacing engineers + +It's about teams that build AI trust incrementally — starting with what's verifiable, moving upward — pulling ahead of teams that either ignore AI or try to automate everything at once. The workflow is designed to be narrow enough to adopt today and extensible enough to grow into whatever the team is ready for next. diff --git a/PHILOSOPHY.human.md b/PHILOSOPHY.human.md new file mode 100644 index 0000000..2402805 --- /dev/null +++ b/PHILOSOPHY.human.md @@ -0,0 +1,73 @@ +# Philosophy + +Read this before you extend the workflow, not before you use it. The agents work without any of this. But if you're making decisions about what to change, what to add, or whether this approach fits your team at all — this is the reasoning behind the choices. Some of those choices look like constraints. They're not. They're opinions. + +--- + +## The conviction + +### Start at the bottom, earn the top + +Every engineering team runs on a spectrum of judgment. Writing code, running tests, reviewing a diff — these sit close to the machine. There are ground truths. Tests pass or they fail. Logic is sound or it has a bug. A security issue either exists or it doesn't. Now go to the other end: deciding what to build, understanding what a user actually needs, making a design call that balances business constraints with technical reality. No ground truth there. Just experience, context, and taste accumulated over years. + +AI should enter that spectrum from the end where it can be verified, not from the end where it can't. + +This isn't a limitation on ambition. It's sequencing. When an AI ships code, an engineer can catch the mistakes with tools they already have. When an AI makes a product call, there's no equivalent check. Asking a team to trust the subjective before they trust the objective isn't bold — it's setting the system up to fail publicly on the hardest possible surface. You don't build trust that way. You destroy it. + +The bottom-up approach works because every layer of trust you establish creates the foundation for the next one. A team that has watched AI reliably deliver correct, tested, reviewed code for a few months thinks differently about what the system could do next. A team that's never trusted the bottom has nothing to build on. + +### Replace one role at a time + +There's a temptation when building agentic systems to think big — automate product, design, engineering, QA, deployment, all at once. The problem is that when a system that complex breaks, and it will, nobody knows where to look. Teams don't iterate on systems they can't debug. They abandon them. + +`eng-team` takes one slice: from a written requirement to a merged pull request. That's it. Everything above that — who decides what to build, how requirements get written — stays with humans. Everything below — how code gets deployed, how production gets monitored — stays with existing tooling. The team doesn't have to change anything else about how it works to get value from this. + +The narrow insertion point is a feature. It means the team can observe the system before trusting it further, and extend it at a pace that matches their comfort rather than the system's ambition. + +### Surface uncertainty, never absorb it silently + +The worst thing an agent can do is proceed on a bad spec. Not because it'll get stuck, but because it'll succeed — it'll write code, pass tests, generate a PR, and everything will look fine until a human tries to use the feature and realizes the agent answered the wrong question. + +Silent assumptions compound in exactly this way. The agent fills in a gap, builds confidently against its own interpretation, the reviewer sees clean code and approves, and the mistake doesn't surface until it's expensive to fix. Compare that to an agent that names the gap, stops, and asks. One of those conversations is awkward for five minutes. The other costs days. + +The agents in this workflow are built to surface what they don't know rather than paper over it. That principle also points to something bigger: the quality of what this system produces is bounded by the quality of what goes into it. The interface where humans hand off requirements to agents is the most consequential design decision in the entire workflow. A tight, unambiguous handoff makes everything downstream better. A loose one — free-text, implicit, underspecified — forces agents to guess, and they will guess wrong often enough to matter. + +--- + +## The workflow + +Three agents, strict separation of concerns. A Tech Lead that reads the requirement and writes a technical spec. An Engineer that implements against the spec, writes tests, and commits. A Reviewer that reads the diff and either approves or comes back with specific things to fix. No agent makes decisions outside its lane. The Tech Lead doesn't implement. The Engineer doesn't review. The Reviewer doesn't rewrite — it points. + +They communicate through a shared scratchpad rather than talking to each other directly. This matters for reliability: if anything breaks in the cycle, the scratchpad has the full reasoning trail — the spec, the gaps the engineer flagged, the reviewer's findings. It's not just state management, it's the audit log. + +The Reviewer's job is worth saying clearly: it reads the diff, not the intent. It doesn't care what the agent was trying to do. Code has to stand on its own merits, because that's what ships. A review that approves well-intentioned code with a real security hole is not a good review. The Reviewer is built to treat the diff as the only thing that matters. + +--- + +## The vision + +### The interface moves, but doesn't disappear + +Right now a human writes a requirement and hands it to the orchestrator. That's the interface. It works, and it requires nothing from the team — whoever writes requirements today keeps writing them exactly as they do. + +But prose is imprecise by nature. Requirements written by humans carry implicit context, assumptions the author didn't know they were making, and gaps that only become visible when an agent hits them mid-implementation. The Tech Lead surfaces what it can, but some ambiguity always gets through. + +The next step is an agent that sits above the Tech Lead — one that takes a rough requirement, asks the right clarifying questions, and produces a structured, complete spec before engineering starts. When that agent exists, the format of the spec is fully controlled. Every decision the Tech Lead would otherwise have to invent is already explicit. The ambiguity problem doesn't get managed better — it gets resolved before it ever enters the engineering pipeline. + +This compounds as the system grows. Each layer added above the current top generates precise inputs for the layer below. Ambiguity gets pushed further upstream with each addition, until it lands where it belongs: with the person who actually knows what the business needs. That person doesn't disappear. Their job just shifts from writing specs to verifying that the system understood them correctly. + +### The gate that stays human + +No matter how many layers get added, one gate should stay manual: a human confirming the requirement is right before engineering begins. Not editing, not rewriting — just confirming the system captured the intent. That's a business judgment, not a technical one. It's the moment where organizational context and strategy get applied, and those don't live in any agent. + +This gate probably outlasts every other human step in the workflow. Maybe it goes eventually. But it should be the last thing to go. + +### Earning the roles above + +As the lower layers prove themselves, the system earns the right to grow upward — into QA, deployment, production monitoring, and eventually into roles that assist with design and product decisions. The key word is assist. The higher up you go, the less the agent's job is to execute autonomously and the more it is to prepare the ground for a human decision: surface the options, make the tradeoffs explicit, flag what it doesn't know. A code reviewer can be fully autonomous because right and wrong are checkable. A product agent that makes calls on user needs without oversight is a different risk category entirely. + +The conviction isn't that AI eventually owns every role. It's that AI should earn each role in turn — starting at the verifiable end, demonstrating reliability, and moving up only when the track record justifies it. That's how trust works between people. There's no reason it should work differently here. + +--- + +This workflow reflects one team's best current thinking on how to build that trust without betting everything on it at once. Take what fits. Push back on what doesn't. The gaps in the system will surface on their own — they always do. diff --git a/PHILOSOPHY.md b/PHILOSOPHY.md new file mode 100644 index 0000000..9a396ed --- /dev/null +++ b/PHILOSOPHY.md @@ -0,0 +1,73 @@ +# Philosophy + +This document explains the beliefs `eng-team` is built on, the approach it takes to integrating AI into an engineering team, and where it is going. It is not required reading to *use* the workflow — the agents work without it. It is required reading to *extend* it, to decide whether it fits your team, or to understand why a design decision that looks like a limitation is actually a deliberate choice. + +--- + +## The conviction + +### Start at the bottom, earn the top + +An engineering team is a hierarchy of judgment. At the bottom sit roles that are closest to computer systems — writing code, running tests, checking security, reviewing for correctness. At the top sit roles that require human intelligence in its most irreducible form — understanding business context, making product decisions, exercising aesthetic judgment, navigating organizational constraints. + +The conviction behind `eng-team` is that AI should enter this hierarchy from the bottom, not the top and not all at once. + +This is not a statement about AI capability. It is a statement about *trust* and *verifiability*. When an AI writes code, engineers can verify it — tests pass or they don't, the logic holds or it doesn't, the diff makes sense or it doesn't. Engineers already have the tools, the expertise, and the instincts to evaluate code output. When an AI makes a product decision or generates a design direction, there is no equivalent verification apparatus. "Correct" is contextual, subjective, and depends on years of domain knowledge that the team has and the AI doesn't. Asking a team to trust AI with the subjective before they trust it with the objective is asking them to jump to the hardest problem first. + +The bottom-up approach earns trust incrementally. A team that has watched AI reliably write correct, reviewed, and tested code is a team ready to think about what the agent above the engineer might look like. A team that has never trusted the bottom has no foundation for trusting anything higher. + +### Replace one role at a time + +The other dimension of this is disruption. Replacing the entire team workflow at once — product, design, engineering, QA, deployment — requires every part of the organization to change simultaneously, and creates a system so complex that when something goes wrong it is impossible to know where. Teams abandon workflows they can't debug. + +`eng-team` replaces one slice: the path from a written requirement to a merged pull request. Everything above that (how the requirement gets written, who decides what to build) and everything below it (how the PR gets deployed, how production is monitored) remains exactly as the team already does it. The insertion point is narrow, the value is immediate, and the team retains full control over everything they already own. + +This is not a limitation. It is the design. + +### Surface uncertainty, never absorb it silently + +An agent that proceeds on a bad spec is more dangerous than one that stops and asks. Silent assumptions — where the agent invents an interpretation of an ambiguous requirement and implements against it — compound quietly. The code looks correct, the tests pass against the agent's interpretation, the reviewer approves the diff, and the PR merges before anyone realizes the feature doesn't match what was intended. + +Every agent in the workflow is built to surface uncertainty before it becomes code. When inputs are unclear, the right behavior is to name the gap and wait — not to fill it in and proceed. This is slower in the moment and faster over the full cycle, because the cost of fixing a wrong interpretation after a commit is always higher than the cost of a clarification before one. + +The corollary: output quality is bounded by input quality. An agent cannot manufacture intent it wasn't given. The human-AI interface — the point where human requirements become agent inputs — is the most important design surface in the entire system. How that interface is structured determines how much ambiguity the downstream agents have to absorb. + +--- + +## The workflow + +`eng-team` implements a three-agent pipeline: a Tech Lead that reads the requirement and produces a technical spec, an Engineer that implements against the spec and ships tested code, and a Reviewer that evaluates the diff and either approves or returns specific, actionable fix instructions. No agent talks directly to another. Each reads its inputs from a shared scratchpad, does its work, writes its outputs back, and signals the orchestrator. The scratchpad is also the audit trail — if something goes wrong anywhere in the cycle, the full reasoning chain is recoverable. + +The principles behind each agent are consistent with the conviction above. The Tech Lead does not implement; it specifies, and it asks rather than assumes when the requirement is underspecified. The Engineer does not review; it builds, tests, and commits, and flags what it cannot resolve rather than guessing. The Reviewer reads the diff — not the agent's intent, not the PRD — because code must stand on its own regardless of what was planned. A review that approves intent rather than output is not a review. + +Together, the three agents convert a human-authored requirement into a committed branch with a full PR description — a unit of work that a human engineer can inspect, verify, and merge with confidence. + +--- + +## The vision + +### The interface moves, but doesn't disappear + +Today the interface between human judgment and AI execution is the PRD — a free-text requirement handed to the orchestrator. This is the right starting point: it requires nothing new from the team, and whoever writes requirements today can keep writing them exactly as they do. + +The limitation is inherent to prose. Natural language carries implicit context, unstated assumptions, and ambiguities the author doesn't know they're making. The Tech Lead absorbs as much of this as it can and surfaces the rest, but some always slips through. + +The next evolution is a PRD Agent that sits above the Tech Lead. Its job is to take a requirement — stated informally, as a business goal or a user complaint or a product idea — and produce a structured, schema-enforced output in exactly the format the Tech Lead expects. When the PRD is AI-generated, the format is fully controlled. Every required decision is explicit. Ambiguity that currently enters at the human-written boundary gets resolved by the PRD Agent before the Tech Lead ever sees it. + +This is the compounding property of building upward. Each new agent added above the current top generates precisely-formatted outputs for the agent below. Ambiguity gets pushed further upstream with every layer, until it reaches the raw human input — the business goal, the user need, the idea — which is the one place ambiguity is irreducible. That is the correct final state: humans own the messy, judgment-heavy inputs at the top, and the system handles everything downstream. + +### The gate that stays human + +Even as new agents are added above, one gate should remain human: the approval of the requirement before it enters the pipeline. Not to rewrite it — to confirm the system captured the intent correctly. This is where the business's intent gets verified, and that is inherently a human responsibility. The interface evolves from "human writes requirement" to "human approves AI-structured requirement" — substantially lower friction, but the checkpoint remains. + +It is probably the last gate to ever be removed, if it is removed at all. + +### Earning the roles above + +As the lower layers stabilize, the system grows upward into roles that carry more human judgment — QA, deployment, production monitoring, eventually design and product assistance. The word "assistance" matters here. The further up the hierarchy you go, the more an agent's role shifts from *executing* to *supporting human decisions*. A code reviewer can be fully autonomous because correctness is verifiable. A product agent should surface options and tradeoffs and make its assumptions explicit — but the human makes the call, until the system has earned the trust to do otherwise. + +The conviction is not that AI will eventually replace every role. It is that AI should earn each role from the bottom up, one at a time, demonstrating reliability before taking on more responsibility. That is how trust works between people. There is no reason it should work differently between people and systems. + +--- + +The workflow is one team's attempt to build that trust systematically. Take what's useful. The gaps will make themselves known. From b57384f51072e1181f66920dce0f47dbacdde34e Mon Sep 17 00:00:00 2001 From: Prashant Mishra Date: Tue, 19 May 2026 14:08:36 +0530 Subject: [PATCH 3/8] Modified Philosophy doc --- PHILOSOPHY.human.md | 73 --------------------------------------------- PHILOSOPHY.md | 8 ----- 2 files changed, 81 deletions(-) delete mode 100644 PHILOSOPHY.human.md diff --git a/PHILOSOPHY.human.md b/PHILOSOPHY.human.md deleted file mode 100644 index 2402805..0000000 --- a/PHILOSOPHY.human.md +++ /dev/null @@ -1,73 +0,0 @@ -# Philosophy - -Read this before you extend the workflow, not before you use it. The agents work without any of this. But if you're making decisions about what to change, what to add, or whether this approach fits your team at all — this is the reasoning behind the choices. Some of those choices look like constraints. They're not. They're opinions. - ---- - -## The conviction - -### Start at the bottom, earn the top - -Every engineering team runs on a spectrum of judgment. Writing code, running tests, reviewing a diff — these sit close to the machine. There are ground truths. Tests pass or they fail. Logic is sound or it has a bug. A security issue either exists or it doesn't. Now go to the other end: deciding what to build, understanding what a user actually needs, making a design call that balances business constraints with technical reality. No ground truth there. Just experience, context, and taste accumulated over years. - -AI should enter that spectrum from the end where it can be verified, not from the end where it can't. - -This isn't a limitation on ambition. It's sequencing. When an AI ships code, an engineer can catch the mistakes with tools they already have. When an AI makes a product call, there's no equivalent check. Asking a team to trust the subjective before they trust the objective isn't bold — it's setting the system up to fail publicly on the hardest possible surface. You don't build trust that way. You destroy it. - -The bottom-up approach works because every layer of trust you establish creates the foundation for the next one. A team that has watched AI reliably deliver correct, tested, reviewed code for a few months thinks differently about what the system could do next. A team that's never trusted the bottom has nothing to build on. - -### Replace one role at a time - -There's a temptation when building agentic systems to think big — automate product, design, engineering, QA, deployment, all at once. The problem is that when a system that complex breaks, and it will, nobody knows where to look. Teams don't iterate on systems they can't debug. They abandon them. - -`eng-team` takes one slice: from a written requirement to a merged pull request. That's it. Everything above that — who decides what to build, how requirements get written — stays with humans. Everything below — how code gets deployed, how production gets monitored — stays with existing tooling. The team doesn't have to change anything else about how it works to get value from this. - -The narrow insertion point is a feature. It means the team can observe the system before trusting it further, and extend it at a pace that matches their comfort rather than the system's ambition. - -### Surface uncertainty, never absorb it silently - -The worst thing an agent can do is proceed on a bad spec. Not because it'll get stuck, but because it'll succeed — it'll write code, pass tests, generate a PR, and everything will look fine until a human tries to use the feature and realizes the agent answered the wrong question. - -Silent assumptions compound in exactly this way. The agent fills in a gap, builds confidently against its own interpretation, the reviewer sees clean code and approves, and the mistake doesn't surface until it's expensive to fix. Compare that to an agent that names the gap, stops, and asks. One of those conversations is awkward for five minutes. The other costs days. - -The agents in this workflow are built to surface what they don't know rather than paper over it. That principle also points to something bigger: the quality of what this system produces is bounded by the quality of what goes into it. The interface where humans hand off requirements to agents is the most consequential design decision in the entire workflow. A tight, unambiguous handoff makes everything downstream better. A loose one — free-text, implicit, underspecified — forces agents to guess, and they will guess wrong often enough to matter. - ---- - -## The workflow - -Three agents, strict separation of concerns. A Tech Lead that reads the requirement and writes a technical spec. An Engineer that implements against the spec, writes tests, and commits. A Reviewer that reads the diff and either approves or comes back with specific things to fix. No agent makes decisions outside its lane. The Tech Lead doesn't implement. The Engineer doesn't review. The Reviewer doesn't rewrite — it points. - -They communicate through a shared scratchpad rather than talking to each other directly. This matters for reliability: if anything breaks in the cycle, the scratchpad has the full reasoning trail — the spec, the gaps the engineer flagged, the reviewer's findings. It's not just state management, it's the audit log. - -The Reviewer's job is worth saying clearly: it reads the diff, not the intent. It doesn't care what the agent was trying to do. Code has to stand on its own merits, because that's what ships. A review that approves well-intentioned code with a real security hole is not a good review. The Reviewer is built to treat the diff as the only thing that matters. - ---- - -## The vision - -### The interface moves, but doesn't disappear - -Right now a human writes a requirement and hands it to the orchestrator. That's the interface. It works, and it requires nothing from the team — whoever writes requirements today keeps writing them exactly as they do. - -But prose is imprecise by nature. Requirements written by humans carry implicit context, assumptions the author didn't know they were making, and gaps that only become visible when an agent hits them mid-implementation. The Tech Lead surfaces what it can, but some ambiguity always gets through. - -The next step is an agent that sits above the Tech Lead — one that takes a rough requirement, asks the right clarifying questions, and produces a structured, complete spec before engineering starts. When that agent exists, the format of the spec is fully controlled. Every decision the Tech Lead would otherwise have to invent is already explicit. The ambiguity problem doesn't get managed better — it gets resolved before it ever enters the engineering pipeline. - -This compounds as the system grows. Each layer added above the current top generates precise inputs for the layer below. Ambiguity gets pushed further upstream with each addition, until it lands where it belongs: with the person who actually knows what the business needs. That person doesn't disappear. Their job just shifts from writing specs to verifying that the system understood them correctly. - -### The gate that stays human - -No matter how many layers get added, one gate should stay manual: a human confirming the requirement is right before engineering begins. Not editing, not rewriting — just confirming the system captured the intent. That's a business judgment, not a technical one. It's the moment where organizational context and strategy get applied, and those don't live in any agent. - -This gate probably outlasts every other human step in the workflow. Maybe it goes eventually. But it should be the last thing to go. - -### Earning the roles above - -As the lower layers prove themselves, the system earns the right to grow upward — into QA, deployment, production monitoring, and eventually into roles that assist with design and product decisions. The key word is assist. The higher up you go, the less the agent's job is to execute autonomously and the more it is to prepare the ground for a human decision: surface the options, make the tradeoffs explicit, flag what it doesn't know. A code reviewer can be fully autonomous because right and wrong are checkable. A product agent that makes calls on user needs without oversight is a different risk category entirely. - -The conviction isn't that AI eventually owns every role. It's that AI should earn each role in turn — starting at the verifiable end, demonstrating reliability, and moving up only when the track record justifies it. That's how trust works between people. There's no reason it should work differently here. - ---- - -This workflow reflects one team's best current thinking on how to build that trust without betting everything on it at once. Take what fits. Push back on what doesn't. The gaps in the system will surface on their own — they always do. diff --git a/PHILOSOPHY.md b/PHILOSOPHY.md index 9a396ed..16c7e20 100644 --- a/PHILOSOPHY.md +++ b/PHILOSOPHY.md @@ -24,14 +24,6 @@ The other dimension of this is disruption. Replacing the entire team workflow at This is not a limitation. It is the design. -### Surface uncertainty, never absorb it silently - -An agent that proceeds on a bad spec is more dangerous than one that stops and asks. Silent assumptions — where the agent invents an interpretation of an ambiguous requirement and implements against it — compound quietly. The code looks correct, the tests pass against the agent's interpretation, the reviewer approves the diff, and the PR merges before anyone realizes the feature doesn't match what was intended. - -Every agent in the workflow is built to surface uncertainty before it becomes code. When inputs are unclear, the right behavior is to name the gap and wait — not to fill it in and proceed. This is slower in the moment and faster over the full cycle, because the cost of fixing a wrong interpretation after a commit is always higher than the cost of a clarification before one. - -The corollary: output quality is bounded by input quality. An agent cannot manufacture intent it wasn't given. The human-AI interface — the point where human requirements become agent inputs — is the most important design surface in the entire system. How that interface is structured determines how much ambiguity the downstream agents have to absorb. - --- ## The workflow From e77548fedaf357cd6a534b0d5402c66a6fdcd5e5 Mon Sep 17 00:00:00 2001 From: Prashant Mishra Date: Tue, 19 May 2026 14:11:09 +0530 Subject: [PATCH 4/8] Removed unnecessary file changes --- .claude/agents/engineer.md | 16 +++++----------- .claude/agents/tech-lead.md | 35 +++++------------------------------ 2 files changed, 10 insertions(+), 41 deletions(-) diff --git a/.claude/agents/engineer.md b/.claude/agents/engineer.md index 231fed6..66c5701 100644 --- a/.claude/agents/engineer.md +++ b/.claude/agents/engineer.md @@ -27,14 +27,8 @@ Read the scratchpad. Study `technical_spec` carefully: - `approach` — implementation strategy to follow - `test_approach` — what to test and how - `acceptance_criteria` — your definition of done -- `style_conventions` — style rules to follow when writing or modifying code -Also read `repo_context` from the scratchpad top level (written by the Tech Lead): -- `test_command` — use this to run the test suite in Steps 4 and 5 -- `lint_command` — use this to run the linter in Step 5 -- `style_notes` — supplementary style rules for the whole repo - -Do **not** re-read `CLAUDE.md` — everything you need is already in the scratchpad. +Also read `CLAUDE.md` for code style conventions and how to run tests. ### Step 2 — Set up the branch ```bash @@ -50,10 +44,10 @@ git checkout -b 2>/dev/null || git checkout Work through `files_to_create` and `files_to_modify` in order. After each logical unit of work, run the relevant tests — don't save all testing for the end. **For each file you create or modify:** -1. Read the file you are editing (or the closest existing file in the same module if creating a new one) to confirm local style. Use `technical_spec.style_conventions` and `repo_context.style_notes` for guidance — do **not** read surrounding files just for style reference. +1. Read any adjacent files first to match style exactly (indentation, import order, naming) 2. Implement the change 3. Write or extend the corresponding test file immediately -4. Run the tests for that file using `repo_context.test_command` +4. Run the tests for that file: check the test command in CLAUDE.md **Test writing rules:** - Follow the existing test file structure and naming convention exactly @@ -62,10 +56,10 @@ Work through `files_to_create` and `files_to_modify` in order. After each logica - Do not write tests that always pass — they must actually assert the behaviour ### Step 4 — Run the full test suite -Once all changes are done, run the full test suite using `repo_context.test_command` from the scratchpad. Fix any failures before committing. If a pre-existing test breaks, investigate — do not delete it. +Once all changes are done, run the full test suite (command in CLAUDE.md). Fix any failures before committing. If a pre-existing test breaks, investigate — do not delete it. ### Step 5 — Run the linter -Run the linter using `repo_context.lint_command` from the scratchpad. Fix all errors. Warnings are acceptable if they pre-existed. +Run the project linter (command in CLAUDE.md or check `package.json` scripts / `pyproject.toml`). Fix all errors. Warnings are acceptable if they pre-existed. ### Step 6 — Commit Stage only the files you intentionally changed: diff --git a/.claude/agents/tech-lead.md b/.claude/agents/tech-lead.md index 5c27574..b3e6d7b 100644 --- a/.claude/agents/tech-lead.md +++ b/.claude/agents/tech-lead.md @@ -16,27 +16,15 @@ If your instructions contain a `Spec gaps to clarify:` section, you are in **cla ## Process -### Step 1 — Read CLAUDE.md and extract repo context +### Step 1 — Read CLAUDE.md Read `CLAUDE.md` in the repo root first. It contains the architecture overview, key directories, conventions, and test commands. This is your primary knowledge source — do not re-explore things it already documents. -Extract the following into a `repo_context` object (you will write this to the scratchpad — downstream agents read it from there instead of re-reading CLAUDE.md): -- `test_command` — the exact command to run the full test suite -- `lint_command` — the exact command to run the linter -- `branch_convention` — the branch naming pattern (e.g. `feat/`, `fix/`) -- `key_directories` — list of important directories from the architecture overview -- `style_notes` — any explicit style rules (indentation, import order, naming, framework conventions) - ### Step 2 — Targeted exploration only -Based on the PRD and CLAUDE.md, identify what you still need to understand. Do NOT scan the whole repo — only look at what is directly relevant. - -**Issue all Grep and Glob calls in a single parallel batch before reading any files.** This means: form the complete list of symbols, routes, or modules you need to locate, then call Grep/Glob for all of them at once. Only after all results come back, issue Read calls for the files you actually need. - +Based on the PRD and CLAUDE.md, identify what you still need to understand. Do NOT scan the whole repo — only look at what is directly relevant: - Use `Grep` to find the specific routes, handlers, or modules the PRD touches - Use `Read` to read those files (not the whole directory) - Use `Glob` only to confirm file locations if CLAUDE.md is unclear -While reading the files the PRD touches, extract style conventions for the relevant module: indentation style, import ordering, naming patterns, any framework-specific idioms. You will write these as `style_conventions` in the spec so the Engineer can match style without re-reading adjacent files. - ### Step 3 — Decide the branch name Choose a branch name following the repo's convention (check `git branch -a` output or CLAUDE.md). Format: `type/short-description` e.g. `feat/cart-rate-limiter`, `fix/order-null-pointer`. @@ -51,19 +39,7 @@ Before writing the spec, classify the change: Write a focused Technical Spec into the scratchpad. Keep it concrete — file paths, function names, config keys. The Engineer must be able to implement without guessing. ## Output -Update the scratchpad JSON with the `repo_context` object at the **top level** (alongside `technical_spec`, not nested inside it): - -```json -"repo_context": { - "test_command": "npm test", - "lint_command": "npm run lint", - "branch_convention": "feat/ or fix/ prefix", - "key_directories": ["src/routes", "src/services", "tests/"], - "style_notes": "2-space indent, single quotes, named exports, Jest for tests" -} -``` - -Then update the scratchpad JSON with this object under the key `"technical_spec"`: +Update the scratchpad JSON with this object under the key `"technical_spec"`: ```json { @@ -95,8 +71,7 @@ Then update the scratchpad JSON with this object under the key `"technical_spec" "Describe the observable outcome that confirms the feature works", "Describe the failure/edge case that must also be handled" ], - "out_of_scope": ["List anything the PRD implies but this spec intentionally defers"], - "style_conventions": "Concise prose: indentation, import ordering, naming conventions, and any framework idioms observed in the files this spec touches. The Engineer uses this to match style without reading adjacent files." + "out_of_scope": ["List anything the PRD implies but this spec intentionally defers"] } ``` @@ -110,7 +85,7 @@ When invoked with `Spec gaps to clarify:`, read the current spec from the scratc 2. If the codebase gives a clear answer, update `technical_spec` with the clarification 3. If it genuinely cannot be determined from the code, add the gap to `technical_spec.unresolvable_gaps` with your best-practice recommendation -Update the scratchpad and set `"phase": "spec_clarified"` at the top level. Do not overwrite `repo_context` — it was written in the initial run and is still valid. +Update the scratchpad and set `"phase": "spec_clarified"` at the top level. ## Rules - Be specific. Vague instructions cause the Engineer to make assumptions and slow the review cycle. From 775ce63d9f51cb2e105b72ee755198249d8eac28 Mon Sep 17 00:00:00 2001 From: Prashant Mishra Date: Wed, 20 May 2026 16:32:45 +0530 Subject: [PATCH 5/8] Quality gates TODO added --- quality-gates.md | 130 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 130 insertions(+) create mode 100644 quality-gates.md diff --git a/quality-gates.md b/quality-gates.md new file mode 100644 index 0000000..6ebf697 --- /dev/null +++ b/quality-gates.md @@ -0,0 +1,130 @@ +# Quality Gates for Autonomous Engineering Teams + +> How to ensure eng-team agents always produce high-quality code, stay on scope, and build human trust in AI-generated PRs. + +--- + +## Why AI Code Quality Degrades + +Before picking gates, it helps to name the failure modes: + +- **Too little context** — the agent doesn't know the repo's conventions, so it invents patterns. +- **Too little scope discipline** — the agent over-engineers because nothing stops it. +- **No verification loop** — the agent writes code and hands it off without checking if it actually works or matches intent. +- **No adversarial review** — the same agent that wrote the code also "reviewed" it. + +Each gate in this document targets one or more of these root causes. + +--- + +## Stage 1: Pre-flight (Before a Single Line Is Written) + +The highest-leverage point is *before* implementation starts. The agent must produce a **plan artifact** — a structured document that states: + +- Which files will change +- Rough line count estimate +- Implementation approach +- How the approach maps to each acceptance criterion in the PRD + +This costs almost nothing and surfaces the biggest risks before wasted compute. + +The plan is checked against: + +**Scope reasonableness** — If the plan touches more than a threshold number of files or LOC for a small feature, that's a flag to surface before implementation begins. + +**Repo structure alignment** — Does the plan follow existing module boundaries, naming conventions, and architectural patterns? A well-maintained `CLAUDE.md` is the primary mechanism here — treat it as a constitution that all agents must read and cite in their plan. + +**Test-first commitment** — The agent declares what tests it will write before writing any implementation. This forces real thinking about the contract, not just the code. + +--- + +## Stage 2: In-flight Controls (While Implementing) + +**Incremental, reviewable commits** — Rather than one giant diff at the end, each logical chunk (a new function, a schema change, a new component) should be a discrete commit. This makes the diff auditable incrementally and makes it far easier to spot drift. + +**Self-critique step** — After writing each logical unit, the agent reads its own diff and answers: +- Is this the minimum change needed? +- Does it follow the pattern used elsewhere in the codebase? +- Am I introducing anything that wasn't in the PRD? + +Catching drift mid-implementation is dramatically cheaper than catching it at review. + +--- + +## Stage 3: Post-implementation Gates (Before PR Is Opened) + +These are the mechanical, automated checks that form the quality floor. + +### Tests Must Pass +The full existing test suite must pass before a PR is opened. If the agent breaks tests, the PR does not open. This is enforced mechanically, not left to the agent's judgment. + +### Test Coverage on New Code +The agent is required to write tests for its own additions. Coverage thresholds apply to the **diff** — not just overall repo coverage — to catch cases where the agent ships logic with zero tests. + +### Static Analysis and Linting +TypeScript strict mode, ESLint, formatters, and any other repo-configured tools must pass at zero-tolerance. The agent runs and fixes these locally before the PR opens. + +### Diff Size Audit +Compare the size of the PR (files changed, LOC) against the stated complexity of the PRD. A one-sentence feature request that produces a 1200-line PR is a signal worth surfacing — it doesn't mean the PR is wrong, but it should trigger human scrutiny before merge. + +### File Blast Radius Check +Which files were modified? If the agent touched a shared utility, a config file, or anything outside the expected module scope, that must be explicitly flagged in the PR description. Unexpected file changes are one of the most common sources of subtle regressions. + +--- + +## Stage 4: The Adversarial Reviewer Agent + +This is the highest-trust gate and the most important one to get right. + +**The agent that writes the code must never be the sole reviewer.** + +A separate agent instance — with fresh context and no attachment to the implementation — reads the PRD and the diff, then answers a structured checklist: + +- Does every acceptance criterion have corresponding code and a test? +- Is there any code that wasn't required by the PRD? +- Are there patterns that diverge from the existing codebase? +- Are there obvious edge cases not handled? +- Is the PR description accurate and complete? + +The output is a **structured review report** attached to the PR. When the human reviewer opens the PR, they see the AI reviewer's assessment alongside the diff — surfacing disagreements, flags, and open questions. This reduces the cognitive load on the human reviewer and focuses their attention where it matters. + +--- + +## Stage 5: Building Trust Over Time + +The gates above catch bad output in the moment. Sustained trust requires a feedback loop. + +**Capture human corrections** — Every time a human reviewer modifies an AI-generated PR, that change should be captured — as an annotated example or a `CLAUDE.md` update. This creates a growing library of "this is what we do here and why," progressively calibrating future agents to the team's standards. + +**Retrospective evals** — Periodically sample merged AI PRs, strip context, and ask a fresh agent: "How would you implement this PRD given this codebase?" If the approach diverges significantly from what was merged, the agents are drifting from what the team actually wants. Use those diffs to improve the `CLAUDE.md` and agent prompts. + +--- + +## Implementation Priority + +Implement in this order for the best return on investment: + +| Priority | Gate | What it addresses | +|---|---|---| +| 1 | `CLAUDE.md` with explicit conventions | Gives agents repo context | +| 2 | Plan artifact + scope check | Catches over-engineering before it happens | +| 3 | Full test suite enforcement | Sets a non-negotiable quality floor | +| 4 | Adversarial reviewer agent | Builds human trust most directly | +| 5 | Diff size + blast radius audit | Catches subtle scope creep | +| 6 | Feedback capture loop | Compounds quality improvements over time | + +--- + +## Summary + +No single gate is sufficient because the failure modes are different at each stage. The combination of: + +- A strong **pre-flight** (scope discipline + plan artifact) +- **Mechanical post-implementation gates** (tests, linting, diff audit) +- An **adversarial reviewer** (independent judgment on correctness and fit) + +...covers the three biggest failure modes. The rest is refinement and iteration as the team builds its feedback corpus. + +--- + +*Document authored from eng-team architectural discussion — May 2026.* From 1b8c145e65fa348eb817d423ff084355c5dfccdb Mon Sep 17 00:00:00 2001 From: Prashant Mishra Date: Wed, 20 May 2026 16:50:14 +0530 Subject: [PATCH 6/8] Add multi-repo quality gates philosophy doc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extends the single-repo quality gate model to microservices environments — covering system-level context, contract-first planning, sequenced parallel execution, and cross-repo adversarial review. Co-Authored-By: Claude Sonnet 4.6 --- quality-gates-multi-repo.md | 96 +++++++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 quality-gates-multi-repo.md diff --git a/quality-gates-multi-repo.md b/quality-gates-multi-repo.md new file mode 100644 index 0000000..27d3cf9 --- /dev/null +++ b/quality-gates-multi-repo.md @@ -0,0 +1,96 @@ +# Quality Gates in a Multi-Repo / Microservices Environment + +> Extending the single-repo quality philosophy to systems where logic is split across services, shared libraries, and utilities — and where one PRD may touch more than one repo. + +--- + +## Why Single-Repo Gates Are Not Enough + +The quality gates designed for a single repo assume one CLAUDE.md, one test suite, one diff to review. In a microservices environment, the failure modes multiply: + +- An agent has full context on Service A but no idea Service B even exists +- A schema change in a shared library silently breaks three consumers +- Two engineer agents modify overlapping contracts in parallel with no coordination +- The reviewer only sees one diff but the bug lives in the interaction between two services + +Each of these requires a gate that simply doesn't exist in the single-repo model. + +--- + +## New Layer Required: System-Level Context + +Each repo keeps its own `CLAUDE.md` for local conventions. But multi-repo changes require an additional layer — an org-level architecture document that every agent reads before planning. + +This document covers: + +- **Service inventory** — what each service owns, its public API surface, who calls it +- **Dependency graph** — which services depend on which, where contracts live (OpenAPI specs, proto files, shared types) +- **Cross-cutting conventions** — auth patterns, error formats, event schemas — things that must be consistent across all services + +This is not a living wiki maintained by hand. It should be partially auto-generated from actual API specs, import graphs, and event bus subscriptions — so it reflects the real system, not someone's memory of it. + +--- + +## Gate 1: Contract-First Planning + +For any multi-repo PRD, the tech-lead's plan artifact must answer: *which service boundaries does this change cross?* + +Any change that touches an API contract, event schema, or shared type must be declared upfront — before any engineer agent starts writing code. The plan names the contract change explicitly. All downstream service changes are derived from it. + +This enforces the right order of operations: **contracts first, implementations second.** Agents cannot drift into incompatible assumptions if the contract is locked before they start. + +--- + +## Gate 2: Sequenced Parallel Execution + +Multiple engineer agents can work in parallel on separate services — but only after the contract is settled. + +The coordination rule: **no agent touches a service that depends on a contract change until that contract change is finalized.** + +This is a sequencing constraint, not a quality check. Violating it means two agents make independent assumptions about the same interface, and both may be wrong in ways that only surface at integration time. + +--- + +## Gate 3: Cross-Repo Adversarial Reviewer + +The single-repo reviewer reads one diff. In multi-repo, the reviewer must read all diffs together and specifically check: + +- Are all contract changes backward compatible — or is there a coordinated breaking change with a migration plan? +- Does Service B's usage of the new API actually match what Service A implemented? +- Is there a service that calls the changed interface that wasn't included in the plan? + +This reviewer requires the dependency graph from the system-level context layer to know which services to check. It cannot discover blast radius from the diffs alone. + +--- + +## Gate 4: Contract Tests as the Quality Floor + +In a single repo, the test suite is the quality floor. In multi-repo, the equivalent is **contract tests** — consumer-driven tests that run against the producer's implementation. + +Every service that publishes an API should have contract tests defined by its consumers. These are the only automated checks that can catch cross-service incompatibilities before integration. Unit tests and linting within each service will not surface interface mismatches. + +--- + +## Revised Implementation Priority (Multi-Repo) + +| Priority | Gate | What it addresses | +|---|---|---| +| 1 | System-level architecture document | Gives agents cross-service context | +| 2 | Per-repo `CLAUDE.md` | Gives agents local conventions | +| 3 | Contract-first plan artifact | Prevents incompatible parallel implementations | +| 4 | Sequenced parallel execution | Enforces contract-before-consumer ordering | +| 5 | Full test suite per repo + contract tests | Sets the automated quality floor | +| 6 | Cross-repo adversarial reviewer | Catches interface mismatches across diffs | +| 7 | Feedback capture loop | Compounds quality improvements over time | + +--- + +## The Unsolved Problem + +The system-level architecture document is only as good as its maintenance discipline. In a fast-moving microservices environment, the dependency graph goes stale quickly. + +The real answer is that this document needs to be auto-generated — derived from actual API specs, import graphs, and event bus subscriptions — not maintained by hand. Until that tooling exists, the document is a useful approximation, not a guarantee. Treat it as the best available context, and build agents that flag when they encounter service references not covered by it. + +--- + +*Document authored from eng-team architectural discussion — May 2026.* From 97a1e4ca5a0b90036e5b65b2aa799fa7959972fe Mon Sep 17 00:00:00 2001 From: Prashant Mishra Date: Wed, 20 May 2026 16:57:13 +0530 Subject: [PATCH 7/8] Add multi-repo agents doc Documents the Contract Agent (new) and extensions to existing agents (Orchestrator, Tech Lead, Engineer, Reviewer) needed to support multi-repo PRDs with parallel execution and contract-first sequencing. Co-Authored-By: Claude Sonnet 4.6 --- agents-multi-repo.md | 120 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 agents-multi-repo.md diff --git a/agents-multi-repo.md b/agents-multi-repo.md new file mode 100644 index 0000000..cb268da --- /dev/null +++ b/agents-multi-repo.md @@ -0,0 +1,120 @@ +# Agents for Multi-Repo Changes + +> Documents the new agents introduced to handle multi-repo PRDs, what each one does, and how they integrate into the existing eng-team workflow. + +--- + +## Updated Workflow + +``` +Orchestrator + └── Tech Lead (multi-repo plan + blast radius analysis) + └── Contract Agent (locks interface changes before anyone implements) + └── Engineer × N (parallel, one per repo, gated on contract finalization) + └── Reviewer (cross-repo, reads all diffs together) +``` + +The Orchestrator drives the full pipeline. The core sequence — Tech Lead → Engineer → Reviewer — stays the same. Two things change: the Contract Agent is inserted between planning and implementation, and the Engineer step becomes parallel across repos. + +--- + +## Existing Agents — What Changes + +### Orchestrator (extended, not replaced) + +The Orchestrator already handles sequencing and agent coordination. For multi-repo changes it needs two behavioral additions: + +- **Parallel engineer dispatch** — spins up one Engineer agent per repo rather than always one +- **Contract gate enforcement** — Engineer agents that depend on a contract change are blocked from starting until the Contract Agent has finalized and published the contract artifact + +No new agent is needed here. This is a logic and configuration extension. + +### Tech Lead (extended) + +In addition to its existing responsibilities, the Tech Lead must: + +- Read the **system-level architecture document** (service inventory, dependency graph, cross-cutting conventions) before writing the plan +- Identify which service boundaries the PRD crosses +- Produce a **cross-repo plan artifact** that maps each acceptance criterion to a specific repo and lists every file expected to change per repo +- Flag any public API surface, event schema, or shared type that will be modified — this is the signal that triggers the Contract Agent + +The plan artifact is the Orchestrator's input for deciding whether to invoke the Contract Agent and how many Engineer agents to spin up. + +### Engineer (extended) + +Each Engineer agent operates on a single repo, same as before. The changes are: + +- Multiple instances run in parallel, one per repo +- Instances that depend on a contract change receive the finalized contract artifact as additional context before starting +- Instances working on independent repos (no shared contract dependency) start immediately in parallel + +### Reviewer (extended) + +The Reviewer receives all diffs across all repos simultaneously and adds one additional check to its existing checklist: + +- Does Service B's usage of the new API match what Service A implemented? +- Is there any service that calls the changed interface that was not included in the plan? +- Are all contract changes backward compatible, or is there a coordinated breaking change with an explicit migration plan? + +The Reviewer requires the dependency graph from the system-level architecture document to know which services to check — it cannot discover blast radius from diffs alone. + +--- + +## New Agent: Contract Agent + +### When it is invoked + +The Orchestrator invokes the Contract Agent when either of the following is true: + +- The PRD touches more than one repo +- The PRD touches a single repo but the Tech Lead's plan flags a change to a public API surface, event schema, or shared library interface + +For purely internal single-repo changes (business logic, UI, infra config with no public interface change) the Contract Agent is skipped entirely. + +### What it does + +The Contract Agent owns the interface boundary between services. Its job is to produce a **contract artifact** — a precise, versioned definition of what is changing at the service boundary — and lock it before any Engineer agent starts implementing. + +Specifically it: + +1. Reads the Tech Lead's plan artifact and the current interface definitions (OpenAPI specs, proto files, shared types, event schemas) for all affected services +2. Produces a diff of the contract change — what is being added, modified, or removed at the interface boundary +3. Checks backward compatibility — flags breaking changes and requires an explicit migration or versioning plan if any exist +4. Publishes the finalized contract artifact so downstream Engineer agents can use it as a source of truth +5. Blocks the Orchestrator from starting any dependent Engineer agent until the artifact is published + +### What it does not do + +- It does not write implementation code +- It does not modify business logic +- It does not review the final diffs — that is the Reviewer's job + +### Output + +A contract artifact containing: +- The precise interface change (structured diff of the API surface) +- Backward compatibility assessment (compatible / breaking + migration plan) +- A list of all services that consume the changed interface, derived from the dependency graph + +--- + +## Sequencing Rules + +| Condition | Contract Agent | Engineer agents | +|---|---|---| +| Single-repo, no public interface change | Skipped | One agent, starts immediately | +| Single-repo, public interface change | Invoked | Starts after contract is finalized | +| Multi-repo, independent services (no shared contract) | Skipped | All agents start in parallel immediately | +| Multi-repo, shared contract change | Invoked | Contract-dependent agents wait; independent agents start immediately | + +--- + +## Summary + +One new agent is introduced: the **Contract Agent**. It fills the gap that exists in the current workflow — there was no role responsible for locking interface changes before implementation begins. Without it, parallel Engineer agents make independent assumptions about the same interface, and mismatches only surface at review or integration. + +Everything else — Orchestrator, Tech Lead, Engineer, Reviewer — retains its existing role and gains scoped extensions to handle multi-repo context and parallel execution. + +--- + +*Document authored from eng-team architectural discussion — May 2026.* From 4da2717c0a8af5e8da6cda30eaa5e3fc66b3b564 Mon Sep 17 00:00:00 2001 From: Prashant Mishra Date: Thu, 21 May 2026 11:10:04 +0530 Subject: [PATCH 8/8] Add harness improvement summary from Code as Agent Harness survey. Documents how the arxiv survey maps to eng-team and a prioritized quality-gates backlog for scratchpad state, verification, and metrics. Co-authored-by: Cursor --- docs/code-as-agent-harness.md | 141 ++++++++++++++++++++++++++++++++++ 1 file changed, 141 insertions(+) create mode 100644 docs/code-as-agent-harness.md diff --git a/docs/code-as-agent-harness.md b/docs/code-as-agent-harness.md new file mode 100644 index 0000000..504924d --- /dev/null +++ b/docs/code-as-agent-harness.md @@ -0,0 +1,141 @@ +# Code as Agent Harness — Implications for eng-team + +Summary of how [Code as Agent Harness](https://arxiv.org/abs/2605.18747) (Ning et al., 2026) relates to **eng-team**, and a prioritized backlog for strengthening the harness. + +--- + +## Article in one paragraph + +The survey argues that in agentic systems, **code is not only output** — it is the **operational harness**: the executable substrate for reasoning, acting, environment modeling, and verification. A good harness makes behavior **executable, inspectable, stateful, and verifiable** over long horizons. Progress depends as much on harness engineering (tools, memory, oracles, control loops, multi-agent shared state) as on the base model. + +--- + +## How eng-team already fits + +eng-team is a **code-centric agent harness** for the slice PRD → spec → implementation → review → merge-ready PR: + +| Paper layer | eng-team today | +|-------------|----------------| +| **Harness interface** | `CLAUDE.md`, `technical_spec`, Engineer edits, Reviewer `git diff` | +| **Harness mechanisms** | Orchestrator phases, bounded loops, `repo_context`, test/lint gates | +| **Multi-agent over code** | Tech Lead → Engineer → Reviewer via `.eng_team/task_*.json` (orchestrator-only; no peer chat) | +| **Verifiable closure** | Tests + linter + structured review checklist | + +This aligns with **PHILOSOPHY.md**: bottom-up trust, narrow insertion point, diff-based review (output over intent). + +The article does **not** suggest replacing this design. It names what to harden next: **oracle quality**, **shared state discipline**, **harness telemetry**, and **governed iteration**. + +--- + +## Key upgrades (article → eng-team) + +### 1. Scratchpad as program state + +Extend `.eng_team/task_*.json` beyond narrative logging: + +- `verification_evidence` (tests run, linter result, diff stats) +- `assumptions[]` with `verified_by` (test / diff hunk / reviewer item) +- Per-phase `read_set` / `write_set` +- Commit pins: `base_commit`, `spec_version`, `impl_commit` + +*Paper: §2.3, §4.2, §5.2.4 — transactional shared program state.* + +### 2. Verification stack (not only “tests passed”) + +On approve, require an **evidence bundle** and explicit limits: + +- What was checked (unit / integration / security hints / coverage on touched files) +- `untested_regions[]` — what the oracle does **not** prove +- For `complex` tasks: runnable `acceptance_checks` or test skeletons in the spec + +*Paper: §5.2.1–5.2.2 — oracle adequacy and semantic verification beyond executable feedback.* + +### 3. Harness-level evaluation + +Log per-run **trajectory metrics** in the scratchpad: + +- Phase durations, clarification/review cycles +- Recovery: each `critical_issue` linked to a fix commit +- `oracle_strength` (trivial vs full checklist, targeted re-review scope) + +*Paper: §5.2.1 — evaluate the harness, not only final task success.* + +### 4. Failure-type routing in the orchestrator + +Route feedback by signal type: + +| Signal | Action | +|--------|--------| +| `spec_gaps` | Tech Lead (max 1 cycle — existing) | +| Test failure | Engineer fix mode | +| Lint only | Engineer, narrow scope | +| Behavior vs spec | Tech Lead, not blind Engineer patch | +| Security/perf | Reviewer targeted re-review | + +*Paper: §3.4 — plan → execute → verify with feedback-driven control.* + +### 5. Action validation (lightweight harness boundary) + +Pre-flight before Engineer acts: + +- Edits only under `files_to_modify` / `files_to_create` +- No edits on `base_branch` +- Bash allowlist from `CLAUDE.md` (no destructive or secret-leaking commands) + +*Paper: §2.2 — code mediates intent; filter invalid actions before execution.* + +### 6. Human gates as durable state + +Scratchpad fields: `human_gates` (`prd_approved`, `spec_approved`, `merge_approved`), `human_resolution` on escalation so later runs do not repeat the same failure. + +*Paper: §5.2.5; **PHILOSOPHY.md** — the gate that stays human.* + +### 7. Cross-task memory (optional, later) + +`.eng_team/learnings.json` for recurring reviewer findings, flaky areas, repo-specific patterns — opt-in, governed. + +*Paper: §3.2 — memory and context engineering.* + +### 8. Harness evolution with regression discipline + +Golden fixture repos + expected scratchpad phases; prompt/checklist changes only with held-out regression tasks and explicit change contracts. + +*Paper: §5.2.3 — self-evolving harnesses without regression.* + +--- + +## What to keep (already strong) + +- Bottom-up, verifiable slice (code → tests → diff review) +- Orchestrator-owned control flow; bounded loops; targeted re-review +- Role/tool separation (Tech Lead no Edit; Reviewer judges diff not intent) +- `/eng-team-context` as environment bootstrapping +- Scratchpad as audit trail + +--- + +## Prioritized backlog + +| Priority | Change | Paper reference | +|----------|--------|-----------------| +| **P0** | Evidence bundle + `untested_regions` on approve | §5.2.2 | +| **P0** | Commit pins + `spec_version` on scratchpad | §4.2, §5.2.4 | +| **P1** | Trajectory / harness metrics in every task JSON | §5.2.1 | +| **P1** | Failure-type routing in orchestrator | §3.4 | +| **P2** | Engineer file-scope + bash policy enforcement | §2.2 | +| **P2** | `acceptance_checks` for `complex` specs | §2.1 | +| **P3** | Cross-task `.eng_team/learnings.json` | §3.2 | +| **P3** | Golden-repo harness regression tests | §5.2.3 | + +--- + +## Bottom line + +eng-team is already a **code-as-harness** system for software engineering. The survey’s main push is to evolve from **prompt orchestration that usually works** to **harness engineering**: every approval carries proof, every phase carries versioned assumptions, and harness failures improve the system with regression discipline — without widening scope beyond the PRD → PR slice until trust is earned. + +--- + +## Reference + +- **Paper:** [Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems](https://arxiv.org/abs/2605.18747) +- **Related repo docs:** `PHILOSOPHY.md`, `README.md`, `.claude/commands/eng-team.md`