diff --git a/docs/abstractions.md b/docs/abstractions.md new file mode 100644 index 0000000..58a7a45 --- /dev/null +++ b/docs/abstractions.md @@ -0,0 +1,181 @@ +# Core Abstractions + +This document describes what the Factory *is* — the two primitives everything else derives from. For how they're implemented, see [Architecture](architecture.md). + +## What It Is (One Sentence) + +An autonomous loop that proposes code changes, measures them, and keeps or reverts — then uses its own outcome history to get better at proposing changes. + +## The Two Primitives + +### 1. The Loop + +``` +Hypothesize → Implement → Measure → Keep/Revert → Learn +``` + +Every code change is an **experiment** with a hypothesis, before/after measurement, and a binary verdict. The experiment — not a task, not a prompt — is the atomic unit. Reverts aren't failures; they're data. + +The loop operates at two apertures: + +- **Inner loop**: Constrain what can change (prompts, parameters, config). Iterate within a fixed architecture. +- **Outer loop**: When inner loop plateaus, widen the aperture. Restructure the architecture itself. + +These aren't two separate systems. They're the same loop with a wider or narrower scope of what's allowed to change. The `mutable_surfaces` / `fixed_surfaces` config is the aperture dial. Plateau detection auto-widens it. + +### 2. The Guardrails + +The loop needs something to prevent it from drifting, regressing, or cheating: + +- **Eval** — a multi-dimensional fitness function that scores every change on a continuous 0–1 scale (not binary pass/fail). +- **Precheck** — non-overridable gates: no score regression, no scope violation, no repeating reverted experiments, no leaking test answers. +- **Auto-generated** — the factory discovers what to measure from the project itself. You don't write the eval; the factory introspects your project and generates one. + +Without guardrails, the loop is an AI coding agent that occasionally checks if tests pass. With them, the loop is self-correcting — it literally cannot merge a regression. + +## Where Everything Fits + +Every module in the repo serves one of these roles: + +| Role | What it does | Modules | +|------|-------------|---------| +| **Loop: State** | Track experiments, history, project state | `store.py`, `models.py`, `state.py`, `events.py`, `checkpoint.py` | +| **Loop: Hypothesize** | Decide what to try next | `strategy.py` (FEEC ranking, anti-patterns, stuck detection) | +| **Loop: Implement** | Execute the hypothesis | `agents/runner.py`, `runners/*`, `worktree.py` | +| **Loop: Learn** | Improve from outcomes | `ace/reflector.py`, `ace/curator.py`, `ace/injector.py` | +| **Guardrails: Measure** | Score every change | `eval/runner.py`, `eval/hygiene.py`, `eval/growth.py`, `eval/scorer.py` | +| **Guardrails: Gate** | Block regressions | `precheck.py`, `eval/guards.py`, `research/leakage.py` | +| **Guardrails: Discover** | Auto-generate evals | `discovery/introspect.py`, `discovery/profile.py`, `discovery/generate.py` | +| **Orchestration** | Wire the loop together | `cli.py`, `ceo_completion.py`, `user_config.py`, `agents/prompts/*.md` | +| **Cross-project** | Transfer learning across projects | `registry.py`, `insights.py` | +| **Visualization** | Render loop state (optional) | `dashboard/*`, `visualizer/*`, `obsidian/*`, `miro/*` | +| **Integration** | Connect to external systems (optional) | `notify/*`, `issue.py`, `review.py`, `mcp_server.py` | + +## Module Map + +### Core: The Loop + +| Module | Lines | What it does | Key implementation choice | +|--------|-------|-------------|--------------------------| +| `store.py` | 633 | Experiment lifecycle: `begin()` → `save_eval()` → `finalize()`. Append-only history. | Flat files (TSV + JSON per experiment), not a database. File-lock for concurrency. | +| `models.py` | 546 | Every data type: `ExperimentRecord`, `FactoryConfig`, `CompositeScore`. | Strict Pydantic v2 with `extra="forbid"`. Types are the contract between stages. | +| `strategy.py` | 405 | FEEC ranking, stuck detection, anti-pattern similarity, plateau detection. | Keyword-based categorization. Jaccard similarity for anti-patterns. Could be learned. | +| `state.py` | 88 | Detect project state: 5-state FSM driving which mode the loop enters. | Simple git + file existence checks. | +| `events.py` | 92 | Append-only JSONL event log. The loop's heartbeat. | One-liner writes. No schema enforcement. | +| `worktree.py` | 167 | Git worktree per experiment for branch isolation. | Each experiment gets its own working copy. Prevents cross-experiment interference. | +| `checkpoint.py` | 88 | Save/restore loop state for crash recovery. | JSON dump of completed/pending phases. | + +### Core: The Guardrails + +| Module | Lines | What it does | Key implementation choice | +|--------|-------|-------------|--------------------------| +| `eval/runner.py` | 328 | Orchestrate 3-tier eval: hygiene + growth + project. Normalize weights. | Subprocess execution. This is the fitness function. | +| `eval/hygiene.py` | 538 | 6 mandatory dimensions: tests, lint, types, coverage, guards, config. | Auto-detects tooling (pytest/jest, ruff/eslint, mypy/tsc). Runs as subprocesses. | +| `eval/growth.py` | 511 | 5 mandatory dimensions: capability surface, diversity, observability, research grounding, effectiveness. | AST-based code analysis. Reads experiment history. No LLM needed. | +| `eval/scorer.py` | 46 | Weighted sum → composite score. | `total = sum(r.score * r.weight)`. Trivial math; the power is in what feeds it. | +| `eval/guards.py` | 215 | Scope guards: did the change touch files outside declared scope? | Git diff + fnmatch. Hard fail on violation. | +| `precheck.py` | 412 | Non-overridable gate: 7 independent checks, ANY failure = mandatory revert. | CEO cannot override. This is the integrity guarantee. | +| `discovery/introspect.py` | 305 | Auto-detect project language, framework, tooling. | File-based heuristics (pyproject.toml → Python). No LLM. | +| `discovery/profile.py` | 149 | Introspection → EvalProfile (list of eval dimensions). | Rule-based mapping. | +| `discovery/generate.py` | 208 | EvalProfile → `eval/score.py` script. | Template-based code generation. | +| `research/leakage.py` | 408 | Ground truth leakage detection (research mode). | Token fingerprinting. Scans for direct values AND negation hints. | + +### Core: Learning (ACE) + +| Module | Lines | What it does | Key implementation choice | +|--------|-------|-------------|--------------------------| +| `ace/reflector.py` | 877 | Parse experiment outcomes → candidate playbook bullets. | Deterministic pattern extraction. No LLM. Correlates categories with keep/revert rates. | +| `ace/curator.py` | 149 | Merge bullets into playbooks, update helpful/harmful counters. | Fuzzy match existing bullets. Prune high-harmful rules. | +| `ace/injector.py` | 45 | Prepend playbook to agent prompts at invocation. | String concatenation. Simple but critical — how learning reaches agents. | +| `ace/models.py` | 126 | `Playbook`, `PlaybookItem` with counters. | YAML frontmatter + markdown. Human-readable. | +| `ace/paths.py` | 54 | Playbook file locations. | `~/.factory/playbooks/.md` — user-local, not per-project. Cross-project learning. | + +### Core: Execution Engine + +| Module | Lines | What it does | Key implementation choice | +|--------|-------|-------------|--------------------------| +| `agents/runner.py` | 313 | Resolve prompt → spawn subprocess → capture output → save review. | Two-tier prompt lookup (project override → factory default) + ACE playbook injection. | +| `agents/prompts/ceo.md` | 2868 | The loop's full protocol: state machine, review gates, sacred rules. | The loop IS this prompt. 2900 lines of structured protocol in markdown. | +| `agents/prompts/*.md` | ~1600 | 10 specialist agent instructions. | Each agent has a focused role. Surprisingly small individual prompts. | +| `runners/protocol.py` | 71 | Runner interface: `headless()` method. | One method: prompt+task in, stdout+exitcode out. | +| `runners/claude.py` | 125 | Claude Code backend (default). | Spawns `claude` CLI as subprocess. | +| `runners/bob.py` | 382 | Bob Shell backend. | Alternative runner. Has token ceiling guardrails. | +| `runners/codex.py` | 183 | OpenAI Codex backend. | `codex exec` with workspace-write sandbox. | +| `cli.py` | 4006 | All CLI commands, wizard, CEO task building. | Monolithic. `_build_ceo_task()` is where config becomes a CEO prompt. | +| `user_config.py` | 276 | `~/.factory/config.toml` with 5-tier precedence. | CLI > env > profile > config.toml > default. | + +### Core: Cross-Project Intelligence + +| Module | Lines | What it does | Key implementation choice | +|--------|-------|-------------|--------------------------| +| `registry.py` | 150 | Global project registry. | Self-registration on first `begin()`. Lives at `~/.factory/registry.json`. | +| `insights.py` | 336 | Aggregate outcomes across all projects. | Reads all registries. Computes category stats, winning/losing patterns. | + +### Optional: Observation & Analysis + +| Module | Lines | What it does | +|--------|-------|-------------| +| `study.py` | 1097 | Codebase analysis: observability coverage, module stats, backlog synthesis. | +| `analysis.py` | 244 | Experiment comparison (`factory diff`) and FEEC analysis (`factory explain`). | +| `summary.py` | 227 | End-of-cycle session summary. | +| `report.py` | 186 | Performance report generation for ACE consumption. | +| `research/runner.py` | 390 | Research run infrastructure: execute commands, parse results, manage artifacts. | +| `ceo_completion.py` | 520 | Auto-resume when CEO exits prematurely. | +| `profile.py` | 232 | User profiling from experiment history (opt-in). | +| `discovery/eval_spec.py` | 217 | Auto-generate eval_spec from project profile. | + +### Optional: Visualization & Output + +| Module | Lines | What it does | +|--------|-------|-------------| +| `dashboard/app.py` | 836 | FastAPI web dashboard with SSE event streaming. | +| `visualizer/state.py` | 359 | Infer live pipeline state from event log. | +| `obsidian/notes.py` | 570 | Write experiment notes to Obsidian vault. | +| `obsidian/templates.py` | 73 | Obsidian note templates. | +| `digest.py` | 244 | Vault activity summarizer. | + +### Optional: Integrations + +| Module | Lines | What it does | +|--------|-------|-------------| +| `review.py` | 144 | Post structured reviews on GitHub PRs. | +| `issue.py` | 188 | Fetch GitHub/GitLab issues as build specs. | +| `notify/telegram.py` | 90 | Telegram digest notifications. | +| `mcp_server.py` | 188 | Expose factory as MCP tools for other Claude Code sessions. | +| `agents/plugin.py` | 111 | Generate Claude Code subagent files. | +| `messages.py` | 129 | User → CEO message queue. | +| `refine_state.py` | 130 | Post-cycle refinement tracking. | +| `backfill_archive.py` | 164 | Backfill missing archive notes. | +| `clean_pr.py` | 213 | Strip non-essential artifacts from PRs. | +| `runners/usage.py` | 166 | Bob Shell token ceiling tracking. | +| `runners/_stream.py` | 99 | Output stream processing. | + +## By the Numbers + +``` +Total: 57 modules, ~19,300 lines Python, ~4,500 lines prompts + +Core: ~10,500 lines (54%) — the loop, guardrails, learning, execution +Optional: ~8,800 lines (46%) — visualization, integrations, observation, utilities +``` + +The core is surprisingly small. The loop, the eval, the precheck, and the agent execution engine — the things that make the system work — are about half the codebase. The rest is reach. + +## Key Implementation Choices (Debatable) + +These are decisions that could go differently. The abstractions (loop + guardrails) are stable; these are where experimentation makes sense. + +| Choice | Current implementation | Alternatives worth exploring | +|--------|----------------------|------------------------------| +| **How the loop hypothesizes** | LLM agent (Strategist) + keyword FEEC ranking | Bandit algorithms, learned priority from outcome data, retrieval-augmented | +| **How guardrails are generated** | File-based heuristics (detect pyproject.toml → pytest) | LLM-generated evals, user-guided, learned dimension importance | +| **The fitness function** | 11 fixed dimensions, configurable weights | Fully user-defined, adaptive weights, learned dimension importance | +| **Agent decomposition** | 8 named specialists with static roles | Fewer (3: plan/build/verify), more (dynamic), or none (single agent) | +| **Learning mechanism** | Text-rule playbooks (DO/DON'T bullets) | Embedding retrieval over past experiments, fine-tuning, in-context examples | +| **Runner abstraction** | Subprocess spawning of CLI tools | Direct API calls, SDK integration, self-hosted models | +| **The CEO protocol** | 2900-line markdown prompt with embedded state machine | Shorter prompt + code-side orchestration, compiled state machine | +| **Aperture control** | Manual surfaces + auto-plateau-detection | Fully adaptive (start narrow, auto-widen as confidence grows) | +| **Loop memory** | Flat files (TSV + JSON in `.factory/`) | SQLite, experiment tracker (MLflow/W&B), vector store | +| **Anti-pattern detection** | Jaccard similarity on hypothesis text | Semantic similarity (embeddings), structural diff comparison | + +The biggest architectural bet: **the CEO prompt IS the loop.** The entire state machine, review protocol, and decision logic lives in a 2900-line markdown file fed to an LLM. The Python layer is pure tools — it doesn't make decisions. This is either the system's greatest strength (easy to evolve, runner-independent, self-documenting) or its greatest liability (fragile, context-window-dependent, hard to test deterministically).