build: generate llms-full.txt for AI ingestion by wilsonwangdev · Pull Request #70 · wilsonwangdev/agent-master-handbook

wilsonwangdev · 2026-05-08T14:58:13Z

Summary

Adds /llms-full.txt — a single markdown document concatenating all published English pages, optimized for AI crawlers that prefer one-shot ingestion over link-following.

Complements #69 (llms.txt as link index, JSON-LD, robots.txt AI allow-list, merged) by giving AI answer engines a single 43KB file containing the full handbook body.

Why this exists

The llms.txt specification defines two variants:

llms.txt — concise link index (already shipped in infra: strengthen AEO baseline for AI answer engines #69)
llms-full.txt — full content, one document

Answer engines that allocate constrained scrape budgets (ChatGPT search, Perplexity, Bing Copilot) often prefer the full version when available because it avoids per-page HTTP overhead and ambiguous ranking across small pages.

Contents of the generated file

Header with site identity (author, repo, homepage, link back to llms.txt index)
Only pages with status: published — drafts are excluded from AI indexes
Ordered by section (concepts → guides → curated → evangelism) then title
Per-page structured header before body: section, canonical URL, last-updated, summary
Source H1 stripped (replaced by the structured ## <title> anchor) to avoid duplicate headings
Separators (---) between pages so AI can parse boundaries

Size: 43KB for current 12 published EN pages. Well under any practical scrape-budget threshold; will grow roughly linearly with published content.

Implementation notes

buildPages now returns the raw body per page (new body field on the pages array). Zero extra I/O — we already have it in memory during page rendering.
buildLlmsFullTxt lives next to buildLlmsTxt and runs in the same parallel Promise.all batch in main.
llms.txt header already advertises llms-full.txt (shipped via infra: strengthen AEO baseline for AI answer engines #69) so crawlers starting from the index discover it.
AGENTS.md documents the status-filter contract so future agents know drafts are excluded by design.

Tested

SITE_URL=https://agent-master-handbook.vercel.app npm run build — 28 pages built, 43,859 byte llms-full.txt.
Verified: 12 published pages included (2 concepts + 5 guides + 4 curated + 1 evangelism), correct section ordering, structured headers on each entry, no duplicate H1s after source-H1 strip.

vercel · 2026-05-08T14:58:18Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agent-master-handbook	Ready	Preview, Comment	May 8, 2026 3:03pm

Concatenates all published English pages into a single ~43KB markdown document at /llms-full.txt. Drafts and hidden pages excluded. Ordered by section (concepts → guides → curated → evangelism) then title. Each entry carries a structured header (section, URL, last-updated, summary) so AI crawlers can anchor citations back to specific pages within the aggregate document. Complements llms.txt (link index) and sitemap.xml (machine-readable URL list). llms.txt now advertises llms-full.txt in its header.

## Summary Closeout for the AEO baseline workstream (#69, #70). Captures the process failure that surfaced mid-session and updates the pre-work checklist to prevent recurrence. ## Why this exists While shipping #69 and #70, both PRs initially carried two orphan commits from a stale local branch whose contents had already squash-merged to `main` as #68. Different SHAs hid the equivalence from `git fetch -p`. The reviewer caught it; both PRs were rebased onto `origin/main` and force-pushed. The journal entry is the third PR-hygiene failure recorded. The pattern across all three: rules exist, but the procedural trigger points where rules must fire are under-specified. ## Changes - `journal/2026-05-08-stale-branch-base-pollution.md` — what happened, root cause, fix, evolution notes (EN+ZH). - `AGENTS.md` Git Workflow — new pre-work step 9: explicitly verify `git branch --show-current` returns `main` right before `git checkout -b`. Squash-merge makes stale-branch detection unreliable via SHA; this is a terminal-state assertion rather than a sequence of actions. - `ROADMAP.md` — check off #67, #69, #70; add "AI visibility / AEO content guide" as the next content item (the planned Layer 1 guide that closes this workstream). ## Manual follow-ups (tracked separately, not in this PR) External submissions that require login and cannot be automated from the repo: - [directory.llmstxt.cloud](https://directory.llmstxt.cloud) / [llmstxt.site](https://llmstxt.site) / [llms-txt-hub](https://github.com/thedaviddias/llms-txt-hub) — submit llms.txt URL - Bing Webmaster Tools — register + submit sitemap (unlocks AI Performance panel with citation counts) - Google Search Console — verify AI Overview citation visibility - [Perplexity Publisher Program](https://pplx.ai/publisher-program) — submit site for source consideration These will be handled out-of-band. The handbook's technical readiness for them shipped in #69 and #70.

## Summary Closes the AEO workstream started in #69 (baseline: robots.txt AI allow-list, JSON-LD differentiation, identity graph) and #70 (llms-full.txt). This is the Layer 1 content deliverable — a practitioner-facing guide on making any site legible to AI answer engines. ## What the guide covers Five signals and the canonical source for each: 1. **llms.txt** — root-level Markdown map (linked to the [llmstxt.org](https://llmstxt.org/) spec by Jeremy Howard). 2. **llms-full.txt** — full-content concatenation for one-shot ingestion. 3. **robots.txt AI bot explicit allow-list** — linked to each vendor's own bot documentation (OpenAI, Anthropic, Perplexity, Google). 4. **JSON-LD differentiation** — `TechArticle`, `HowTo`, `citation`, `@graph`; linked to [schema.org](https://schema.org/) and [Google's structured data docs](https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data). 5. **Submissions** — Bing Webmaster (AI Performance panel), Google Search Console, `llms.txt` directories, Perplexity Publishers. Plus an explicit **What to Avoid** section: spam-for-AI (generating low-effort content to game citations) is called out as the failure mode that this discipline is sometimes confused with. ## Source traceability Per the handbook's source-traceability rule, every specific practice links to its primary source. A final **Inspiration** section acknowledges [Tw93's May 2026 post](https://x.com/HiTw93/status/2049868069208768812) as the synthesis that triggered this handbook's own AEO baseline work, without making it a substitute for the canonical sources. ## Layer positioning (SPEC 006) **Layer 1 Universal Content** — no hardcoded references to this project in the main body. Where the handbook's own implementation illustrates a practice, it appears in clearly marked callout blocks (for example, the `build.mjs` reference under the llms.txt section), not as the guide's main thread. ## Tested - `SITE_URL=https://agent-master-handbook.vercel.app npm run build` — 30 pages built (was 28, +2 for EN/ZH). - Verified EN page emits `TechArticle` schema with author and publisher, matches other guides. - Verified the new guide is indexed in `llms.txt` and included in `llms-full.txt` (which grew from 43KB to 56KB). - No CSS, template, or interactive component changes, so frontend interaction review does not apply. ## ROADMAP Item will be checked off when this PR merges, following the same convention as other published guides.

Base automatically changed from infra/aeo-baseline to main May 8, 2026 14:59

wilsonwangdev force-pushed the build/llms-full branch from 494a2ad to 72743cd Compare May 8, 2026 15:02

vercel Bot deployed to Preview May 8, 2026 15:03 View deployment

wilsonwangdev merged commit 2c18718 into main May 8, 2026
3 checks passed

wilsonwangdev deleted the build/llms-full branch May 8, 2026 15:07

wilsonwangdev mentioned this pull request May 8, 2026

journal: record stale-branch base pollution from PR #69/#70 #71

Merged

wilsonwangdev mentioned this pull request May 8, 2026

content: add AI visibility / AEO guide (EN+ZH) #72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build: generate llms-full.txt for AI ingestion#70

build: generate llms-full.txt for AI ingestion#70
wilsonwangdev merged 1 commit into
mainfrom
build/llms-full

wilsonwangdev commented May 8, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wilsonwangdev commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this exists

Contents of the generated file

Implementation notes

Tested

Uh oh!

vercel Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wilsonwangdev commented May 8, 2026 •

edited

Loading

vercel Bot commented May 8, 2026 •

edited

Loading