Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions .claude/skills/ai-feature/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
name: ai-feature
description: Use when the user adds, changes, or refactors an LLM agent under src/agents/ — or anything that goes through Foundation Model API or an MLflow-traced LLM call. Mandatory under CNS §3.5 and .cursor/12-ai-feature-lifecycle.mdc. Walks the SPEC → dataset → eval-harness → impl → re-eval sequence.
---

# OntoBricks ai-feature

Triggered automatically for any change under `src/agents/**`, `src/back/core/agents/**`, or an MCP tool that wraps an agent. The companion gate is `.cursor/12-ai-feature-lifecycle.mdc`. **This skill is the path of least resistance to passing that gate.** Skip this skill, and the CI gate (G2 — `.github/workflows/eval-gate.yml`) will reject the PR.

## Why this exists

OntoBricks ships **5 production agents** (`agent_owl_generator`, `agent_ontology_assistant`, `agent_auto_assignment`, `agent_auto_icon_assign`, `agent_dtwin_chat`) with **zero eval coverage today**. A prompt regression or a tool-handler bug ships green because the existing tests only mock the LLM. This skill enforces a SPEC-first + eval-gated lifecycle so quality is observable.

## Procedure

7 steps. The first 4 happen **before** code changes (the gate is on the artifact, not the runtime). Steps 5–7 are the implementation loop.

### 1. Brainstorm (≤ 10 min)

Invoke `superpowers:brainstorming`. Surface:

- **Purpose** in one sentence.
- **Target users**: who calls this agent? (LLM client via MCP? Internal UI? Another agent?)
- **In/out shape**: input format, output format.
- **Success criteria**: what does "right" look like in three examples?
- **Failure modes**: what's the worst output? How would a user notice?

Capture decisions at the top of `.planning/<slug>/PLAN.md`.

### 2. Fill SPEC.md

Copy `.claude/skills/ai-feature/SPEC.template.md` to `.planning/<slug>/SPEC.md`. Fill every section. **Do not skip the eval-dimensions table** — that's what the CI gate parses.

Required fields:

- `agent_name`, `module_path` (e.g., `src/agents/agent_fact_checker/`).
- `model_endpoint` — the Databricks Foundation Model API endpoint name.
- `tools[]` — tool name, JSON schema, intended LLM use.
- `eval_dimensions[]` — name, metric, threshold, weight.
- `failure_modes[]` — symptom, detection (which judge / trace tag), mitigation.

### 3. Build the eval dataset

Output: `.planning/<slug>/eval/dataset.jsonl` AND a mirror at `tests/eval/datasets/<agent_name>/baseline.jsonl`.

Minimum sizes:

| Change type | Min examples |
|---|---|
| New agent | 20 (15 hand-curated + 5 synthetic) |
| Material change (prompt, tool surface) | 10 |
| Hotfix / regression test | 3 (the failing cases) → `regression.jsonl` |

Row shape (JSON-Lines, one example per line):

```json
{"input": {...}, "expected": {"contains": [...], "schema": {...}, "constraints": [...]}, "tags": ["happy" | "ambiguous" | "adversarial"]}
```

Sources:

- **Hand-curated**: from product team intuition, real user requests, existing screenshots.
- **Synthetic**: use the `databricks-synthetic-data-generation` skill to bootstrap. Tag as `tags: ["synthetic"]` so reviewers know.
- **Regression**: every production-trace failure that you fix gets added here. Never deleted.

### 4. Wire the eval harness

Output: `tests/eval/run_<agent_name>.py`.

Pattern:

```python
import mlflow
from databricks_mlflow_evaluation import evaluate
# ...
result = evaluate(
data="tests/eval/datasets/<agent_name>/baseline.jsonl",
model=<your agent invocation callable>,
judges=[...] # see tests/eval/judges/
)
mlflow.log_metric("judge_score", result.aggregate_score)
```

Run a **baseline** before any code change. Record the MLflow run URI in the PR body (template `.github/PULL_REQUEST_TEMPLATE.md` has a slot).

### 5. Plan and implement

Invoke `superpowers:writing-plans` referencing SPEC.md. The plan lives at `.planning/<slug>/PLAN.md` and lists:

- Files to add / change under `src/agents/<agent_name>/` (follow `src/agents/engine_base.py` loop pattern).
- Tool definitions (`TOOL_DEFINITIONS`) and handlers (`TOOL_HANDLERS`).
- Tracing wiring: `@trace_agent`, `@trace_llm`, `@trace_tool` on every code path (`src/agents/tracing.py`).
- Unit tests with `httpx.MockTransport` (see `tests/fixtures/http.py`).

Run `superpowers:test-driven-development`. Red → Green → Refactor.

### 6. Re-run the eval

After the implementation lands locally, re-run `tests/eval/run_<agent_name>.py` against the same dataset. **Two outcomes:**

- **Judge score ≥ baseline + delta:** great, push the PR with the new MLflow run URI.
- **Judge score < threshold:** iterate. Look at failed examples (the judge writes per-example pass/fail to MLflow). If the failure is the eval being wrong, edit the dataset and document why in `.planning/<slug>/PLAN.md`.

Borderline cases get the `superpowers:requesting-code-review` invocation, then a reviewer's judgment via a waiver comment in the PR.

### 7. Ship

- `superpowers:verification-before-completion` — tests + eval green.
- `code-review` (project skill).
- `changelog` (project skill).
- Conventional Commit: `feat(agents): add agent_<name>` or `fix(agents): tune <agent> threshold`.
- PR; CI G1 + G2 both pass.

## What this skill **does not** do

- It doesn't deploy the agent. Use `deploy` once the PR merges.
- It doesn't run the LLM in CI. CI runs only the unit tests + the eval comparison against the **committed** dataset; the LLM call happens locally or against a configured serving endpoint, with the result captured in MLflow.
- It doesn't replace `superpowers:brainstorming` or `superpowers:writing-plans`. It **sequences** them.

## Cross-references

- `.cursor/12-ai-feature-lifecycle.mdc` — the rule that gates this work.
- §3.5 of `/Users/dermot.smyth/.claude/plans/ultrathink-perform-a-detailed-whimsical-token.md` — methodology context.
- `src/agents/engine_base.py` — the runtime pattern to follow.
- `tests/fixtures/http.py` — `agent_mock_transport` for unit tests.
- `tests/fixtures/mlflow.py` — `captured_traces` for span-shape assertions.
- `databricks-mlflow-evaluation` (plugin skill) — the harness.
- `databricks-synthetic-data-generation` (plugin skill) — dataset cold-start.
- `agent-evaluation` (plugin skill) — umbrella for the eval flow.
125 changes: 125 additions & 0 deletions .claude/skills/ai-feature/SPEC.template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# SPEC: <agent-name>

> Copy this template to `.planning/<slug>/SPEC.md` and fill every section.
> Required by `.cursor/12-ai-feature-lifecycle.mdc` (CI gate `.github/workflows/eval-gate.yml` checks for presence + non-empty `eval_dimensions`).
> Use one SPEC per agent. Material changes to an existing agent update the existing SPEC in-place; CI inspects the diff for an eval-dimensions update.

---

## 1. Purpose

<!-- One paragraph: what does this agent do, for whom, and why now? Avoid jargon. -->

## 2. Identity

| Field | Value |
|---|---|
| `agent_name` | `agent_<snake_case_name>` |
| `module_path` | `src/agents/<agent_name>/` |
| `model_endpoint` | `<Databricks Foundation Model API endpoint, e.g., databricks-meta-llama-3-3-70b>` |
| `temperature` | `0.0` (eval-deterministic; production may use higher) |
| `mlflow_experiment` | `/Shared/ontobricks/agents/<agent_name>` |

## 3. Tool surface

| Tool name | Input schema | Output type | Purpose |
|---|---|---|---|
| `<tool_name_1>` | `{"x": "string", "y": "int"}` | `dict` | <one-liner> |
| `<tool_name_2>` | … | … | … |

For each tool, paste the JSON schema below.

<details>
<summary><code>&lt;tool_name_1&gt;</code> schema</summary>

```json
{
"type": "object",
"properties": {
"x": {"type": "string"},
"y": {"type": "integer"}
},
"required": ["x"]
}
```
</details>

## 4. Success criteria

Three concrete examples (input → expected output shape) that an LLM consumer should see succeed:

1. **<scenario name>**
- input: `…`
- expected: `…`
2. **<scenario name>**
- input: `…`
- expected: `…`
3. **<scenario name>**
- input: `…`
- expected: `…`

## 5. Eval dimensions

The CI gate parses this table — keep it well-formed.

| Dimension | Metric | Threshold | Weight | Judge |
|---|---|---|---|---|
| `correctness` | <metric, e.g., schema_valid, exact_match, top-K accuracy> | `0.90` | `0.40` | <judge name from `tests/eval/judges/`> |
| `faithfulness` | <metric> | `0.85` | `0.25` | <judge> |
| `latency_p95` | seconds | `<= 5.0` | `0.15` | wall-clock |
| `cost_per_call` | USD | `<= 0.01` | `0.10` | MLflow usage record |
| `tool_selection` | exact-match on first tool called | `0.95` | `0.10` | rule-based |

**Aggregate threshold:** weighted sum ≥ <e.g., 0.85> to pass G2.

## 6. Failure modes

For each known failure mode, declare how it's detected and how it's mitigated.

| Symptom | Detection | Mitigation |
|---|---|---|
| <e.g., hallucinated URI> | Judge `faithfulness` < 0.6 on any example | Stricter system prompt; retry with smaller context |
| <e.g., wrong tool chosen> | `tool_selection` < 0.9 over 10-call window | Add example to `regression.jsonl`; tune tool descriptions |
| <e.g., latency spike> | P95 > 8s | Cache common queries; lower max_tokens |

## 7. Eval dataset

- **Baseline file:** `tests/eval/datasets/<agent_name>/baseline.jsonl` — ≥ 20 examples for new agents, ≥ 10 for changes.
- **Synthetic file:** `tests/eval/datasets/<agent_name>/synthetic.jsonl` — generated via `databricks-synthetic-data-generation`. Tag examples with `tags: ["synthetic"]`.
- **Regression file:** `tests/eval/datasets/<agent_name>/regression.jsonl` — every production failure we fix lands here. Never retired.

Dataset row shape:

```json
{"input": {...}, "expected": {"contains": [...], "schema": {...}, "constraints": [...]}, "tags": ["happy" | "ambiguous" | "adversarial" | "synthetic"]}
```

## 8. MLflow tracing

Every code path that calls Foundation Model API must be decorated:

```python
from agents.tracing import trace_agent, trace_llm, trace_tool

@trace_agent
def run(...): ...

@trace_llm
def _call_model(...): ...

@trace_tool
def my_tool_handler(...): ...
```

Trace assertions in unit tests use the `captured_traces` fixture (`tests/fixtures/mlflow.py`).

## 9. Plan reference

Implementation plan: `.planning/<slug>/PLAN.md` (produced by `superpowers:writing-plans`).

## 10. Sign-off

- [ ] Author has filled every section.
- [ ] Baseline eval run URI pasted into PR body.
- [ ] Aggregate threshold ≥ declared value in §5.
- [ ] Reviewer waiver (if applicable): _____
98 changes: 98 additions & 0 deletions .claude/worktrees/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Worktree convention — multi-agent coordination

Under CNS (§3.7), worktrees are the escape hatch when Cursor can't span what you need. This README is the protocol so two agents (human or LLM) working the same day don't collide.

## Why a worktree?

Open one only when:

1. Mid-T3/T4/T5 and a P0 production bug lands (T6). Stash is risky because your IDE chat context is tied to current state.
2. Running a long eval / DAB deploy / Hypothesis sweep and you want to keep coding.
3. `superpowers:dispatching-parallel-agents` is going to operate on a clean state.

If the change is <30 min and you can stash safely, **don't open a worktree.** Tax > benefit.

## Naming

```
.claude/worktrees/<slug>-<8-char-hash>/
```

- `slug` matches the `.planning/<slug>/` directory name (the issue title, lowercased and hyphenated).
- `hash` is `git log -1 --format=%h` from main, truncated to 8.

Examples:
- `.claude/worktrees/digitaltwin-split-p1-1db8647c/`
- `.claude/worktrees/icon-bug-3f9c2a1e/`

The branch inside the worktree has the same name. So `git worktree list` reads naturally.

## Mechanics

```bash
# Create from a clean main
HASH=$(git -C <main-checkout> log -1 --format=%h)
SLUG="my-feature"
git worktree add ".claude/worktrees/${SLUG}-${HASH}" -b "${SLUG}-${HASH}"

# Open in Cursor: open a second window at the worktree path.
# OR open Claude Code in the worktree terminal — preferred for agent-driven parallel work.

# Cleanup after merge
git worktree remove ".claude/worktrees/${SLUG}-${HASH}"
git branch -d "${SLUG}-${HASH}"
```

## `.planning/` is per-worktree

Each worktree has its own `.planning/<slug>/PLAN.md` (and SPEC.md for agent features). PLAN.md is the resumption substrate — re-open the worktree, `cat .planning/<slug>/PLAN.md`, pick up where the checkboxes left off.

Don't share PLAN.md across worktrees. If two slugs need to share decisions, lift them into `.planning/ROADMAP.md` instead.

## `changelogs/` is **shared**

Both worktrees write into the same `changelogs/<YYYY-MM-DD>.log`. To avoid stomping each other on the same day:

**Convention:** every section header includes a worktree suffix when there's parallel work.

```markdown
## SHACL severity filter (worktree shacl-severity-a1b2c3d4)
...

## Icon bug fix (worktree icon-bug-3f9c2a1e)
...
```

Two agents writing the same day → two `##` headings, both unique. The merge is naturally a string-append, no conflict.

CI dedupes if the headers happen to collide (M3.P2 changelog-presence gate compares the diff against `changelogs/`, not full equality).

## Two harnesses in two worktrees

| Setup | When | Notes |
|---|---|---|
| Cursor in main + Cursor in worktree | Two humans, one repo, one day | OK. Two windows, two contexts. |
| Cursor in main + Claude Code in worktree | One human, parallel automated work | **Best for T5 refactors** — Claude Code runs the parallel-agent sweep while you keep coding in main. |
| Claude Code in main + Cursor in worktree | One human, agent doing long-running work | OK but unusual. Usually flip the pairing above. |
| Cursor + Cursor in the same checkout | Always | **Don't.** Each Cursor window has its own context — they'll race on writes. |
| Claude Code + Claude Code in the same checkout | Always | **Don't.** Same reason. Spawn parallel **subagents** instead via `superpowers:dispatching-parallel-agents`. |

## Disallowed

- **Never edit `src/` outside the active worktree** during a phase execution. Use the `freeze` skill (gstack) to make it audible.
- **Never invoke `gsd-*` skills.** CNS dropped them in v2 (§3.0); pre-commit hook blocks references.
- **Never push `.claude/worktrees/<slug>/` to remote.** Worktrees are local. The branch is what gets pushed.

## Cleanup checklist

When a worktree's branch merges to main:

1. `git worktree remove .claude/worktrees/<slug>-<hash>` (releases the working tree).
2. `git branch -d <slug>-<hash>` (cleans up the local branch).
3. Update `.planning/ROADMAP.md`: mark the Issue `[x]` with a landing date.
4. Optionally archive `.planning/<slug>/` to `.planning/done/<slug>/` if it had useful research notes; otherwise delete.

## See also

- §3.7 of the methodology plan (`/Users/dermot.smyth/.claude/plans/ultrathink-perform-a-detailed-whimsical-token.md`).
- `superpowers:using-git-worktrees` — the canonical skill.
Loading