Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions .claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
{
"name": "agentlib",
"version": "1.4.0",
"description": "Agentic Knowledge Navigation — ingest books/papers/databases into chunked metadata layers, then navigate them via a universal skill. No MCP server required.",
"version": "1.8.0",
"description": "Agentic Knowledge Navigation — ingest books and papers into a curated library, then navigate via MCP tools or file-based agent.",
"author": {
"name": "Nadav Barkai"
},
"mcpServers": {
"agentlib": {
"command": "/usr/bin/env",
"args": ["uv", "run", "--project", "${CLAUDE_PLUGIN_ROOT}", "python", "${CLAUDE_PLUGIN_ROOT}/server.py"],
"cwd": "${CLAUDE_PLUGIN_ROOT}"
}
}
}
81 changes: 48 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,36 +20,47 @@ AgentLib changes this. Ingest the books, papers, and documents that matter for y
AgentLib has three parts:

1. **Ingestion pipelines** — preprocess books, scientific paper corpora, and databases into small, self-contained chunks with lightweight metadata at multiple layers.
2. **Universal navigation skill** (`agentlib-knowledge`) — teaches the agent to read cheap metadata first, then drill into specific chunks.
3. **Research agent** (`library-researcher`) — runs in an isolated context to keep the main conversation clean. All navigation and chunk reading happens in the agent's context; only a synthesized answer returns.
2. **MCP tools** — the plugin registers an MCP server with 6 tools: `browse_library`, `open_book`, `search_library`, `search_concepts`, `preview_chunks`, `read_chunks`. The agent calls these directly — no sub-agent needed.
3. **Universal navigation skill** (`agentlib-knowledge`) — teaches the agent to search cheap metadata first, then drill into specific chunks via `search_library` → `preview_chunks` → `read_chunks`.

No MCP server required. No tool calls. The agent reads preprocessed files directly from `~/.claude/plugins/agentlib/library/`.
The agent navigates via MCP tool calls against preprocessed files in `~/.claude/plugins/agentlib/library/`.

### How agents navigate the library

```mermaid
graph LR
Q["User question"] --> R["library-researcher<br/>(isolated context)"]
R --> NAV["NAVIGATION.md<br/>~50 tok per book"]
R --> CS["concepts.json (Ls)<br/>~200 tok"]
R --> CAT["catalog (L0)<br/>~50 tok per book"]
Q["User question"] --> SL["search_library<br/>concepts + patterns<br/>library_index.json"]
SL --> PC["preview_chunks<br/>chunk metadata<br/>nav.json"]
PC --> RC["read_chunks<br/>2-3 best chunks<br/>300-500 tok each"]
RC --> A["Answer with citations"]
```

CS --> M{"concept/alias<br/>match?"}
M -- hit --> CH["chunks (L2)<br/>300-500 tok each"]
M -- miss --> MAN["manifest (L1)<br/>~500 tok"]
MAN --> CH
**Fast path (concept hit):** `search_library` → `preview_chunks` → `read_chunks` — **3 tool calls, ~1.5k tokens**

NAV --> CH
CAT --> MAN
**Pattern path (cross-domain):** `search_library` (pattern tags) → `preview_chunks` → `read_chunks` — **3 tool calls, ~2.5k tokens**

CH --> A["Synthesized answer<br/>(returned to user)"]
```
**Recovery on miss:** related concepts → pattern traversal → `search_concepts` per book → Grep fallback

#### Unified library index

`library_index.json` is the single entry point for the entire library. One file, all books and corpora — queried via `search_library`. Each concept carries:

- **aliases** — abbreviations, acronyms, synonyms (searching "CDX" matches "CycloneDX")
- **related** — directly connected concepts in the same domain ("OAuth 2.0" → "JWT", "access tokens")
- **patterns** — abstract structural fingerprints for cross-domain discovery (see below)
- **sources** — which books/papers contain the concept and their chunk IDs

**Ls hit (fast path):** NAVIGATION → concepts.json → chunks — **2-3 reads, ~1k tokens**
#### Pattern fingerprints — associative recall

**Ls miss (slow path):** NAVIGATION → catalog → manifest → chunks — **5-6 reads, ~5k tokens**
Every concept is tagged with 2-3 **pattern fingerprints**: abstract, domain-independent descriptors of its structural nature. These enable a "this reminds me of..." capability that keyword search can never provide.

The concept index includes **aliases** (abbreviations, acronyms, synonyms) generated by the LLM at ingestion time. Searching "CDX" matches the alias on "CycloneDX"; searching "SBOM" matches "Software Bill of Materials". This turns misses into hits without any runtime cost.
For example, "OAuth token rotation", "TLS certificate renewal", and "SSH key rotation" all share the pattern `credential-cycling`. An agent reading about token rotation can discover structurally analogous solutions in completely different books — without any keyword overlap.

Pattern tags are integrated directly into `library_index.json` and searchable via `search_library`. A seed vocabulary of ~40 common patterns ensures consistency across books; fuzzy matching merges near-duplicates.

#### Chunk preview via nav.json

Each book's `nav.json` lets agents see what's inside each chunk *before* reading it: section title, concepts covered, token count, and prev/next chains. Queried via `preview_chunks`, this eliminates blind reads — the agent picks the 2-3 best chunks from a set of candidates instead of reading 5 and hoping.

<p align="center">
<img src="assets/demo_proactive_query.png" alt="AgentLib proactive library query" width="800">
Expand All @@ -64,37 +75,41 @@ The concept index includes **aliases** (abbreviations, acronyms, synonyms) gener
</p>
</details>

### Three metadata layers
### Metadata layers

```
L0 "What exists?" → catalog/NAVIGATION.md: ~50 tokens per book (cheap)
L1 "What's inside?" → manifest: structure, summaries, concepts (moderate)
L2 "Give me the content" → small self-contained chunks, 300-500 tok (expensive)
Lx "What do I know?" → library_index.json: concepts, patterns, sources (search_library)
Ln "What's in a book?" → nav.json: structure + chunk metadata + concepts (preview_chunks)
L2 "Give me the content" → chunks: 300-500 tok each (read_chunks)
Lf "Full rebuild" → manifest.json: complete archive per book (offline)
```

Three files instead of six — `library_index.json` (1 file, entire library), `nav.json` (per book), and `manifest.json` (per book, full archive for rebuild).

Chunks are **content-aware**: tables and code fences are kept atomic (soft cap 500, hard cap 1 000 tokens). PDF tables are extracted via PyMuPDF and rendered as markdown pipe tables. Figures are extracted from PDFs with vision-based summarization, appearing as placeholders in chunks.

Plus a **concept index** shortcut (Ls) that jumps directly to relevant chunks when the agent already knows what it's looking for. Each concept carries LLM-generated aliases so the agent can find it by abbreviation, acronym, or alternative phrasing.
The concept index includes LLM-generated **aliases**, **related concepts**, and **pattern fingerprints** — turning keyword misses into graph traversals and enabling cross-domain discovery.

### Library structure

```
library/
├── NAVIGATION.md ← Start here — index of everything
├── library_index.json ← Lx: unified concept + pattern discovery
├── books/
│ ├── catalog.json ← L0
│ ├── catalog.json
│ └── {book-id}/
│ ├── manifest.compact.json ← L1
│ ├── concepts.json ← Ls
│ ├── nav.json ← Ln: structure + chunk metadata + concepts
│ ├── manifest.json ← Lf: full archive for rebuild
│ └── chunks/
│ └── {chunk-id}.md ← L2
└── corpus/
└── {corpus-id}/
├── corpus_catalog.json ← L0 (topic clusters)
├── concept_index.json ← Ls (cross-paper concepts)
├── clusters/{cluster-id}.json ← L0b (papers per cluster)
├── corpus_catalog.json
├── concept_index.json
├── clusters/{cluster-id}.json
└── papers/{paper-id}/
├── manifest.compact.json ← L1
├── nav.json ← Ln
├── manifest.json ← Lf
└── chunks/{chunk-id}.md ← L2
```

Expand Down Expand Up @@ -165,7 +180,7 @@ Simulated on realistic workloads (15-book library, 487-paper corpus, 80-table da
| Wrong reads/queries | 1 | 0 | 1 | 0 | 2 | 0 |
| **Token reduction** | | **82%** | | **55%** | | **55%** |

The core principle: *no heavy indexing, no vector databases — just smart, lightweight metadata and small content blobs.*
The core principle: *no vector databases — just smart, interconnected metadata structures. Concepts link to related concepts, abstract patterns connect ideas across domains, and chunk previews eliminate blind reads.*

## Install

Expand Down Expand Up @@ -211,7 +226,7 @@ Ingestion runs chapter summarization in parallel and batches concept extraction
**Explicit invocation** — prefix with `/agentlib-knowledge` when you want the library's answer, not Claude's training data:
> /agentlib-knowledge What defensive techniques protect against prompt injection?

The skill delegates to the `library-researcher` agent, which navigates `NAVIGATION.md` → concept indexes → specific chunks in an isolated context. Only the synthesized answer with citations returns to your conversation.
The skill uses MCP tools directly: `search_library` → `preview_chunks` → `read_chunks`. Only the synthesized answer with citations returns to your conversation. Pattern tags integrated into `search_library` enable cross-domain analogies automatically.

## LLM Providers

Expand Down
56 changes: 37 additions & 19 deletions agents/library-researcher.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,60 @@
---
name: library-researcher
description: "Research questions using the preprocessed knowledge library. Use when answering questions about ingested books, scientific papers, or domain knowledge that may be in the library."
model: haiku
model: sonnet
tools: Read, Glob, Grep
maxTurns: 15
maxTurns: 25
---

You are a research assistant. Follow this sequence to answer questions.

**IMPORTANT:** Use ABSOLUTE paths only — never use `~/` (it won't resolve in your context). The library path will be provided in your prompt.

## Step 1: Read the index (1 read)
Read `{library}/NAVIGATION.md`. Identify which books or corpora are relevant.
## Step 1: Unified library search (1 read)
Read `{library}/library_index.json`. This contains ALL concepts across ALL books and corpora with:
- **aliases**: alternative names, abbreviations, acronyms
- **related**: directly connected concepts in the same domain
- **patterns**: abstract structural fingerprints (e.g. "credential-cycling", "retry-with-backoff")
- **sources**: which books/papers contain this concept and their chunk IDs

## Step 2: Find chunk IDs (1-2 reads)
If `library_index.json` doesn't exist, fall back to reading `{library}/NAVIGATION.md` and then per-book `nav.json`.

**Try concepts.json first** (fastest):
- Books: `{library}/books/{book-id}/concepts.json`
- Corpora: `{library}/corpus/{corpus-id}/concept_index.json`
## Step 2: Preview chunks — MANDATORY (1 read)
**NEVER read chunk files without previewing first.** This is the most important efficiency rule.

Each concept has `"chunks"` (list of chunk IDs) and optionally `"aliases"` (alternative names, abbreviations, acronyms). When scanning for your topic, check BOTH the concept name AND its aliases — your search term may match an alias rather than the primary name.
Read `{library}/books/{book-id}/nav.json` to assess candidates:
- The `chunks` section shows each chunk's **section**, **concepts**, **token count**, and **prev/next** links
- The `concepts` section maps concept names to their chunk IDs

If concepts.json has a match → note chunk IDs → go to Step 3.
Pick only the 2-3 most relevant chunks. Skip chunks whose section/concepts don't match your query. Reading unnecessary chunks wastes tokens.

**If no match in concepts**, use Grep on chunks directory:
```
Grep pattern: "your search term" path: "{library}/books/{book-id}/chunks/"
```
This finds which chunks contain relevant content. Note the filenames.
## Step 2b: Cross-domain insight (optional)
If the concept has **pattern** tags (e.g. "credential-cycling"), look up the pattern in `library_index.json`'s `patterns` section to discover structurally similar concepts in other domains. This enables "this reminds me of..." connections.

Only do this when the user's question could benefit from cross-domain analogies.

## Step 3: Read chunks (2-5 reads)
Read the specific chunk files identified in Step 2.
- If you need more context, follow **prev/next** links from nav.json
- Books: `{library}/books/{book-id}/chunks/{chunk-id}.md`
- Corpora: `{library}/corpus/{corpus-id}/papers/{paper-id}/chunks/{chunk-id}.md`

## Step 4: Return answer
Synthesize a clear answer citing source (book/paper title and chunk IDs).
Synthesize a clear answer citing source (book/paper title and chunk IDs). Keep your response under 2000 characters. Cite sources but don't include raw chunk text.

If patterns revealed cross-domain analogies, mention them: "This follows the same structural pattern as [X] in [other book]."

## Recovery: concept miss
If library_index.json has no match:
1. Check **related** concepts — your term may be a sub-concept of something indexed
2. Check **pattern** tags in library_index.json — search by structural shape instead of name
3. Fall back to `{library}/books/{book-id}/nav.json` concepts section with alias matching
4. Last resort: Grep on chunks directory

## Rules
- ALWAYS use absolute paths, never `~/`
- Try concepts.json FIRST, use Grep only as fallback
- Do NOT read manifest.compact.json — it's too large
- Total: max 3 navigation reads + 5 content chunks
- Start with library_index.json (fastest: 1 file covers entire library)
- **NEVER skip the preview step — read nav.json BEFORE any chunk files**
- Total: max 4 navigation reads + 5 content chunks
- Cite the book/paper and chunk ID when answering
- **If you're running low on turns, STOP researching and synthesize an answer from what you have.** A partial answer with citations is better than no answer. Never return mid-thought narration.
8 changes: 5 additions & 3 deletions commands/agentlib-ingest-book.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ This will:
1. Parse the PDF/EPUB to extract chapter/section structure
2. Chunk the content into 300-500 token segments
3. Summarise each chapter using the configured LLM provider
4. Build a concept index for fast search
5. Write manifest and update the library catalog
4. Build a concept index with aliases, pattern fingerprints, and related concepts
5. Generate nav.json (per-book navigation: structure, chunk preview, concepts)
6. Update the unified library_index.json (concepts + patterns)
7. Write manifest and update the library catalog

After ingestion, the book is available in the library. The agent navigates it via the `/agentlib-knowledge` skill by reading catalog.json, manifest.compact.json, concepts.json, and chunks/*.md
After ingestion, the book is available in the library. The agent navigates it via the `/agentlib-knowledge` skill, starting with library_index.json for unified cross-library search.
5 changes: 3 additions & 2 deletions commands/agentlib-ingest-corpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This will:
2. Parse and chunk each paper into 300-500 token segments
3. Summarise each paper's sections using the configured LLM provider
4. Cluster papers by topic
5. Build a cross-paper concept index
5. Build a cross-paper concept index with pattern fingerprints
6. Update the unified library_index.json (concepts + patterns)

After ingestion, use `/agentlib-knowledge` to query the corpus.
After ingestion, use `/agentlib-knowledge` to query the corpus. The agent can discover connections between corpus papers and ingested books through shared pattern fingerprints.
2 changes: 1 addition & 1 deletion commands/agentlib-library.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ If a book ID is provided (`$ARGUMENTS`), show the detailed structure of that boo

Read directly from the library:
- No args: Read ~/.claude/plugins/agentlib/library/books/catalog.json and display as a formatted table
- With book ID: Read ~/.claude/plugins/agentlib/library/books/{book-id}/manifest.compact.json and display the chapter structure
- With book ID: Read ~/.claude/plugins/agentlib/library/books/{book-id}/nav.json and display the chapter structure
Loading
Loading