Status: 🟢 Implemented · Owner: Core team · Depends on: 04, 05, 06, 07, 08, 24
The bootstrap runner is wrapped behind the EnvelopeProducer
trait (spec 24 / ADR-010) by
[crates/cortex-cli/src/bootstrap/producer.rs]. One
producer_checkpoints row lands per produce invocation with
scope repo_id (lowercase) and cursor token = the final
last_file path. BootstrapProducer::resume_from reads the
latest row from SQLite; future bootstrap callers can pass the
returned cursor straight into run_repo_with_dedup's
last_file parameter so multi-repo state accumulates across
invocations without the legacy .cortex-bootstrap.state.json
singleton being load-bearing.
Ship a one-shot + incremental CLI that walks an existing Hive repo (or a list of repos), synthesizes events for its source code, docs, specs, git history, and rule/law files, and drives them through the live processing pipeline (cortex.events.bootstrap stream → classifier → embedder → graph writer → full-text indexer). This is what gives the very first pre_change_context query something to retrieve.
In:
cortex-bootstrapcrate: Rust binary, Clap CLI.- Per-repo configuration (
cortex.toml) with sensible defaults. - Source discovery: git metadata, file walker, doc parser, ADR recognizer.
- Synthetic event emitters (envelope matches spec 01; kind is one of the new
artifact.*,turn.historical,decision.imported,law.imported,memory.imported). - Publication to
cortex.events.bootstrapSynap stream. - Progress reporting (ETA, per-repo counters).
- Resume support via a checkpoint file.
--dry-run --estimatemode (no writes; prints sizing).--only <repo[,repo,...]>/--skip <repo[,...]>filters.
Out:
- The processing pipeline itself (specs 05–08).
- Filesystem watcher / git-hook tailing (separate Phase-2 spec).
- Webhook receiver (Phase 2).
- CI-agent integration (Phase 3).
cortex-bootstrap [OPTIONS] <REPO_ROOT>...
OPTIONS:
--config <file> Global config override (default: ./cortex-bootstrap.toml)
--only <name[,name]> Include only listed repos
--skip <name[,name]> Exclude listed repos
--since <git-ref> Only re-index changes since this ref (incremental mode)
--dry-run No writes; print plan
--estimate Implies --dry-run; print sizing (files, chunks, bytes, est. cost)
--resume Resume from the last checkpoint
--workspace <file> TOML file enumerating multiple repos (phase4b)
--force Re-run repos whose checkpoint reports `done` (phase4b)
--parallelism <N> Number of concurrent repo walkers (default: 4)
--synap-endpoint <url> Override Synap connection
--stream <name> Destination Synap stream (default: cortex.events.bootstrap)
--checkpoint <file> Checkpoint file path (default: .cortex-bootstrap.state.json)
--log-format <json|text> Structured logs
--verbose Debug logging
[cortex]
id = "Vectorizer" # override repo name (default: basename)
[cortex.exclude]
paths = ["target/", "node_modules/", "dist/", "tmp/"]
extensions = ["lock", "log", "pack", "png", "jpg", "pdf"]
[cortex.chunking]
code_strategy = "symbol" # symbol | window | auto (default: auto)
doc_strategy = "section" # section | window (default: section)
[cortex.redaction]
extra_patterns = [
{ name = "internal_token", regex = "HIVE_TOKEN_[A-Z0-9]{24}" }
]
[cortex.git]
include_commits = true
include_prs = true # requires `gh` CLI and auth
since = "2019-01-01" # ignore ancient commits (default: all)
[cortex.decisions]
promote_patterns = [
"docs/decisions/*.md",
"openspec/**/proposal.md",
"ADR-*.md"
]
[cortex.laws]
promote_patterns = [
"rulebook/laws/*.yaml",
".cursor/rules/*.md"
]
[cortex.analyses]
promote_patterns = [
"docs/analysis/**/*.md",
"docs/analyses/**/*.md"
]
[cortex.memories]
import_files = ["CLAUDE.md", "AGENTS.md", ".rulebook/memory/**/*.md"]Files matching [cortex.analyses].promote_patterns are emitted as
analysis.imported events with payload { title, status, body, source_path }. They route to a dedicated per-repo
cortex-{repo}-analyses Meili index + Vectorizer collection, and the
graph writer materialises them as (:Analysis)-[:ANALYZES]->(:Repo).
This is the path audit / deep-analysis (spec 15) reports take to reach
the dashboard's Analysis surface.
If cortex.toml is missing, defaults apply:
- exclude common junk (
target/,node_modules/,.git/,dist/, binary extensions) - symbol chunking for code, section chunking for docs
- all git commits included, no PR enrichment
- no extra redaction patterns
Each source produces envelope-compliant events (spec 01). Examples:
Code artifact (one event per top-level declaration after Tree-sitter split):
Historical turn (one event per git commit):
{
"event_id": "01HXZ...", "ts": 1710000000000,
"kind": "turn.historical",
"adapter": "bootstrap",
"source": { "repo": "Vectorizer", "git_ref": "abc123", "author": "a@b.c" },
"redacted_payload": {
"role": "developer",
"message": "fix: raise ef_search default for 1M-vector benchmarks",
"evidence": { "files_changed": ["src/index/hnsw/mod.rs"], "diff_summary": "+40 / -12" }
},
"content_hash": "sha256:..."
}Decision (promoted from ADR / OpenSpec proposal):
{
"event_id": "01HY0...", "ts": 1712000000000,
"kind": "decision.imported",
"adapter": "bootstrap",
"source": { "repo": "Vectorizer", "path": "docs/decisions/0042-hnsw-ef-default.md" },
"redacted_payload": {
"title": "Raise HNSW ef_search default to 128",
"status": "accepted", "supersedes": null,
"body": "...markdown content..."
},
"content_hash": "sha256:..."
}Law (imported from existing rule files):
{
"event_id": "01HY1...", "ts": 1713000000000,
"kind": "law.imported",
"adapter": "bootstrap",
"source": { "repo": "Rulebook", "path": "rulebook/laws/LAW-007.yaml" },
"redacted_payload": {
"law_id": "LAW-007", "title": "...", "severity": "critical",
"detector": "hook:pre_commit_no_skip", "body": "...markdown..."
},
"content_hash": "sha256:..."
}{
"version": 1,
"started_at": "2026-04-17T20:30:00Z",
"repos": {
"Vectorizer": {
"files_walked": 4821, "files_total": 4821,
"commits_walked": 3124, "commits_total": 3124,
"events_emitted": 9573,
"status": "done",
"last_git_ref": "abc123"
},
"Nexus": {
"files_walked": 1204, "files_total": 5200,
"status": "in_progress",
"last_file": "src/query/parser.rs"
}
}
}Written atomically (write-rename) every 5 s while running. On --resume, the CLI picks up from last_file / last_git_ref.
┌─────────────────┐
│ Repo loader │ reads cortex.toml, applies defaults
└────────┬────────┘
▼
┌─────────────────┐ ┌──────────────────┐ ┌────────────────────────┐
│ File walker │ ──▶ │ Chunker prefilter│ ──▶ │ Synthetic event emit │
│ (ignore list) │ │ (quick size/ext) │ │ (envelope assembler) │
└─────────────────┘ └──────────────────┘ └───────────┬────────────┘
│
┌─────────────────┐ │
│ Git log walker │ ──▶ turn.historical events ────────────▶
└─────────────────┘ │
▼
┌─────────────────┐ ┌────────────────────────┐
│ ADR / OpenSpec │ ──▶ decision.imported ────▶ │ Redactor (cortex-core) │
│ recognizer │ │ (static patterns) │
└─────────────────┘ └───────────┬────────────┘
▼
┌─────────────────┐ ┌────────────────────────┐
│ Law / memory │ ──▶ law.imported + ─▶│ Publisher → Synap │
│ recognizer │ memory.imported │ cortex.events.bootstrap │
└─────────────────┘ └────────────────────────┘
Everything downstream (classifier, embedder, graph, full-text) is the same as live ingestion — the CLI does not touch those systems directly.
- Uses
ignorecrate (respects.gitignoreby default, pluscortex.tomlexcludes). - Emits
artifact.codefor source files (language detected by extension or Tree-sitter, including.vueSFCs — issue #3),artifact.docfor*.md. - Files > 10 MB are skipped with a warning (same rule as spec 08 truncation, but here we skip entirely because they likely aren't source).
git log --all --diff-filter=AMD --name-only --format='%H|%at|%ae|%s|%b'.- One event per commit. PR body (via
gh api) is appended ifcortex.git.include_prs = trueand a PR number is derivable from the commit message. - Merge commits (multiple parents) are emitted once; their diffs are the squashed diffs from the nearest non-merge parent.
Every synthetic event passes through cortex-core's redactor (spec 04) before publication. Repo-specific extra patterns merge into the global catalog for the duration of that repo's walk.
- Each event carries
content_hash = sha256(canonical_json(redacted_payload)). - Downstream writers (spec 06 embedder, spec 08 indexer) are keyed on
content_hash— re-running bootstrap is a no-op at the storage layer. - The checkpoint file prevents re-publishing unchanged events to Synap, saving classifier cost.
The 2026-04-29 audit caught the walker re-emitting every file
under fresh ULIDs on every run (26 decisions for 2 ADRs on disk,
37 laws for 12 rule files, etc.). The pre-phase10c walker keyed
events on (repo, path, run_id) so each invocation produced new
ULIDs for unchanged content; downstream writers' content-hash
dedup absorbed the bytes but the lane still piled up rows.
Phase10c adds a (repo, path, content_hash) ledger keyed on the
redacted body hash (the input to the emitter's redact pipeline,
deterministic across runs):
- Walker computes
body_hash = sha256(body)per accepted file. - Before publishing the file's events, the runner reads
bootstrap_seen(repo, path)from the metadata DB. - Match (
stored.content_hash == body_hash): suppress every event for that file and refreshlast_run_idonly. TheRepoRunReport.files_suppressedcounter surfaces the suppression rate. - Mismatch / absent: publish as usual, then upsert
(repo, path, body_hash, run_id, now)into the ledger.
Wired through run_repo_with_dedup;
existing run_repo callers
that don't pass a ledger continue to publish on every run.
The metadata table — bootstrap_seen(repo, path, content_hash, last_run_id, last_emitted_at) — is documented in
spec 02 §Metadata store.
When the ledger is empty AND the live lane carries
> 2 × disk_count rows for any of the
:Decision / :Law / :Analysis classes, the runner surfaces a
DuplicateLanePreflight finding. The bin caller (CLI / cron job)
logs a warning telling the operator to run
cortex-ops bootstrap-dedup --dry-run --repo <name> and
investigate before re-walking. The check is pure (no I/O); the
caller passes the disk + lane counts on hand.
cortex-ops bootstrap-dedup [--repo NAME] [--dry-run] [--apply]
[--metadata-db PATH] [--json]
--dry-run(read-only) walks thebootstrap_seenledger and reports(content_hash, paths[])groups whose size ≥ 2 — the same redacted body emitted under different file paths (a less common but real corruption mode).--applyis reserved for the future live-backend cleanup path (Vectorizer / Meili / Nexus row deletion). Today it exits with code3and a documentation pointer; the dry-run output remains the actionable surface.--metadata-dbdefaults to$CORTEX_METADATA_DBthen<home>/.cortex/metadata.sqlite.
cortex-bootstrap --workspace <ws.toml> drives multiple repos in one invocation. The TOML file lists every repo by id + path:
[[repo]]
id = "Cortex"
path = "E:/HiveLLM/Cortex"
[[repo]]
id = "Vectorizer"
path = "E:/HiveLLM/Vectorizer"
config = "cortex.toml" # optional override; defaults to <path>/cortex.tomlThe orchestrator runs in three phases:
- Pre-flight —
cortex_bootstrap::workspace::preflightwalks the entry list and aggregates every problem into a singleWorkspaceError::Preflight(Vec<String>). Verifies: non-empty + uniqueid, path exists,<path>/.gitexists,cortex.tomlis readable. The CLI aborts before walking any repo when preflight fails — no partial runs against a broken workspace. - Per-repo iteration — repos run in declaration order. The runner reuses the existing
run_repofor each entry; events flow through the same publisher / redactor / checkpoint pipeline as a single-repo invocation. Positional<REPO_ROOT>...args still work and merge with the workspace (deduped on path). - Skip-when-done — on each iteration the orchestrator checks the checkpoint: when
repos[id].status == "done"ANDlast_git_ref == git rev-parse HEAD, the repo is bypassed with aninfolog line.--forcere-runs done repos regardless. After every successful run the orchestrator stampslast_git_ref = current_head_sha(root)so the next invocation has a value to compare against.
The final output is a single summary table on stderr listing every repo with events, dropped, duration, and status (done / bypassed / failed: <reason>). The process exits with code 1 when any repo failed.
The actual replay of the user's 17 Hive checkouts against a live cluster is an operations step. The runbook lives at docs/operations/bootstrap-workspace.md; the source-controlled template at bootstrap.workspace.toml.example lists every Hive repo as a literal ${HIVE_ROOT}/<RepoName> placeholder so a single search-and-replace sufficies. The CI guard at crates/cortex-bootstrap/tests/workspace.rs::bootstrap_workspace_example_loads parses the template through cortex_bootstrap::workspace::load_workspace so a typo fails CI before reaching the operator.
Progress is surfaced three ways:
- Stderr progress bar (one line per repo) when
--log-format=textand TTY. - Structured logs (JSON lines) when
--log-format=json. - Metrics emitted to
cortex.metrics:
cortex.bootstrap.files.walked counter, labels: repo
cortex.bootstrap.files.skipped counter, labels: repo, reason
cortex.bootstrap.events.emitted counter, labels: repo, kind
cortex.bootstrap.bytes.processed counter, labels: repo
cortex.bootstrap.commits.walked counter, labels: repo
cortex.bootstrap.repo.duration_s histogram, labels: repo
cortex.bootstrap.errors counter, labels: repo, stage
Walks the repo without emitting events; prints:
Repo: Vectorizer
Files (after excludes): 4 821
Code chunks (est): 25 400 [Tree-sitter symbol-level]
Doc chunks (est): 3 100 [Markdown section-level]
Commits: 3 124
Est. events: 31 624
Est. redacted bytes: 48.2 MB
Est. Haiku classifier cost (input tokens): 12.1 M → ~$1.06
Est. Haiku classifier cost (output tokens): 4.3 M → ~$0.22
Est. embedding storage: 71 MB
Est. graph nodes: 28 900 (+ 52 000 edges)
Est. full-text index: 94 MB
Est. one-time runtime: 32 min (at 500 eps)
Numbers are back-of-envelope; real cost depends on classifier cache hit rate.
| Failure | Handling |
|---|---|
| Synap unreachable | Fail fast |
| Repo path does not exist | Fail with a clear error listing the provided paths |
.git missing |
Walk source files only; skip git log; warn |
| Tree-sitter grammar missing | Fall back to fixed-window chunking (same rule as spec 06); warn |
| Oversize file (>10 MB) | Skip; counter bootstrap.files.skipped{reason=oversize} |
| Checkpoint write failure | Fail fast (can't guarantee resume correctness) |
gh CLI missing |
PR enrichment disabled; warn once per run |
| Partial repo run interrupted (Ctrl-C) | Checkpoint still valid; --resume picks up |
The walker promotes the canonical .rulebook/ layout into typed
events on every repo, regardless of whether the repo ships a
per-repo cortex.toml. This closes the 2026-04-27 audit's
decisions = 0 / laws_active = 0 finding for spec-driven
projects: the institutional knowledge already exists on disk,
the indexer just needed to know where to look.
| Repo-rooted path | Walker class | Emitter event | Notes |
|---|---|---|---|
.rulebook/decisions/**/*.md |
FileClass::Decision |
decision.imported (one per file) |
ADR-style; first H1 → title, Status: → status. |
.rulebook/specs/**/*.md |
FileClass::Law |
law.imported (fan-out — one per ## heading) |
Spec doc → addressable laws. See "Spec splitter" below. |
.rulebook/knowledge/patterns/**/*.md |
FileClass::Memory |
memory.imported |
Reusable patterns; surfaced in /v1/dashboard/memory. |
.rulebook/knowledge/anti-patterns/**/*.md |
FileClass::Memory |
memory.imported |
Anti-patterns; same surface as patterns. |
.rulebook/learnings/**/*.md |
FileClass::Memory |
memory.imported |
Implementation insights — clipped excerpts in the Memory view. |
.rulebook/handoff/**/*.md |
FileClass::Memory |
memory.imported |
Cross-session hand-off snapshots; per-project Handoffs view. |
.rulebook/{PLANS,STATE,COMPACT_CONTEXT}.md |
FileClass::Memory |
memory.imported |
Loose top-level memory files. |
The patterns are hardcoded in walker::RULEBOOK_DECISION_GLOBS,
RULEBOOK_LAW_GLOBS, and RULEBOOK_MEMORY_GLOBS so a sibling
repo (Nexus / Vectorizer / Synap / etc.) without its own
cortex.toml still gets full discovery coverage. Per-repo
cortex.toml [cortex.{decisions,laws,analyses,memories}]
patterns continue to win when set — explicit promotion always
overrides the canonical defaults.
A spec doc like .rulebook/specs/RULEBOOK.md typically carries
many distinct rules under ## headings. Treating the whole doc
as one law collapses every rule under one un-addressable id.
The splitter solves this by:
- Walking lines, treating each
##at column 0 as a section boundary. - Synthesising
law_id = LAW-{filename-stem}-{section-index}-{section-slug}for every section so operators can grep for the prefix to find every law from a given spec. - Stamping the section heading text as
title. - Carrying the section body (heading + content up to the next
##or EOF) asbody. - Emitting one
law.importedevent per section.
Preamble text (before the first ## ) is intentionally dropped
— the dashboard already surfaces the full doc as a memory
entry; this path only materialises the addressable laws.
When a spec doc has zero ## boundaries, the splitter falls
back to emit_law_imported's single-law-per-file shape so
flat rule files keep working.
- Classifier worker (
cortex-classifier-worker::kinds):law.importednormalises toKind::LawViolation(the canonical Kind enum has no separateLawvariant today; "imported law" and "law violation" share storage but never collide because imported laws never carry aviolation_id). - Fulltext worker: routes the event to
cortex-{repo-slug}-governance. Each spec section becomes its own Meili doc, searchable bylaw_id/title/body. /v1/dashboard/laws+laws_activeoverlay: read every governance hit, populateLawRefentries whenextras.law_idis present. The orchestrator'sderive_lawsalready reads this field — no changes needed downstream.
After re-bootstrapping a repo:
cortex-bootstrap <repo>walks the.rulebook/specs/**tree and emits Nlaw.importedevents per spec doc.cortex-fulltext-workerindexes each intocortex-{slug}-governance.cortex-apiboot'smeili_loaderprojects them onto the keyword lane withextras.law_idstamped.curl http://127.0.0.1:17000/v1/dashboard/lawsshows the per-section laws underLAW-{stem}-NN-{slug}ids.
-
cortex-bootstrap Vectorizer/ --estimatecompletes on a cold cache, prints the sizing block, writes no events. -
cortex-bootstrap Vectorizer/end-to-end produces events across all four downstream writers (Vectorizer, Nexus, Meilisearch); verified viacortex query(spec 11 stub) or direct queries to each service. - Re-running the same
cortex-bootstrap Vectorizer/produces near-zero new chunks/nodes/docs (cache hits + content-hash dedup). -
--only Vectorizer,Nexusrestricts to two repos and ignores others passed on CLI. -
--since HEAD~200emits events only for commits within the last 200. - Checkpoint + resume: SIGINT mid-run;
--resumepicks up without duplicate events. - ADR recognition: a file matching
cortex.decisions.promote_patternsproducesdecision.importedevents and ends up as aDecisionnode in Nexus. - Law recognition: a file matching
cortex.laws.promote_patternsproduceslaw.importedevents and ends up as aLawnode in Nexus (spec 13 seeding). - Analysis recognition: a file matching
cortex.analyses.promote_patternsproducesanalysis.importedevents, lands incortex-{repo}-analyses(Meili + Vectorizer), and surfaces as(:Analysis)-[:ANALYZES]->(:Repo)in Nexus. - Redaction: a committed
.envfile with synthetic secrets is detected + patterns are stripped before publication; unit test asserts no secret bytes reach Synap. - Parallel walk:
--parallelism 4against 4 small repos finishes in ~1/4 the wall-clock of sequential; no event loss. - Tree-sitter-missing language: Elixir files land with
source=fallback_windowchunks downstream. -
--dry-runwrites no events, prints the plan. - Telemetry counters non-zero at end of a real run; bootstrap run for the 17-repo corpus finishes within the 4–8 h envelope (architecture §6.5).
- Synthetic events flow through the live pipeline. No "bootstrap mode" shortcuts. If a bug shows up in the classifier, it shows up for live events too — one system of record.
- Separate Synap stream.
cortex.events.bootstrapis distinct fromcortex.events.rawso we can throttle or pause bootstrap without starving live capture. - File-level checkpointing. Per-commit / per-file granularity; resume is trivially correct. Sacrifices some resume speed in exchange for no-duplicate guarantee.
- Respect
.gitignoreby default. Surprising developers by indexingtarget/bytes is worse than missing something. Explicit opt-in viacortex.toml. - PR enrichment is optional.
ghauth is not always available. Degrade gracefully. - No reindex triggers from the CLI. The CLI writes events; it does not reach into Vectorizer/Nexus/Meilisearch. Everything downstream is idempotent, so a re-run is the refresh mechanism.
- Cross-repo symbol resolution. When the same function name appears in multiple repos, do we emit
SIMILAR_TOhints here or defer to query-time derivation? Deferring to spec 11 for now. - Incremental bootstrap scheduling. Should
--sincedefault to the checkpoint'slast_git_refon subsequent runs, making continuous bootstrap trivial? Yes; tracked as a follow-up once the Phase-2 tailing spec lands.
- Architecture §6 (bootstrap), §6.3 (repo priority list), §6.5 (sizing).
- Spec 01 — Event schema.
- Spec 04 — Cortex Core (redactor, ingestion router).
- Spec 05 — Classifier (downstream consumer).
- Spec 06 — Embedder (symbol-level chunking).
- Spec 07 — Graph writer (consumes decision/law events).
- Spec 08 — Full-text indexer.
- Spec 10 — Claude Code adapter (not this spec, but a sibling producer of events).
ignorecrate: https://docs.rs/ignore (gitignore-aware walker).ghCLI: https://cli.github.com (optional PR enrichment).
{ "event_id": "01HXY...", "ts": 1713369600000, "kind": "artifact.code", "adapter": "bootstrap", "source": { "repo": "Vectorizer", "path": "src/index/hnsw/mod.rs", "symbol": "hnsw_search", "byte_range": [120, 840], "git_ref": "abc123" }, "redacted_payload": { "text": "pub fn hnsw_search(...) { ... }", "language": "rust" }, "content_hash": "sha256:..." }