build: generate llms-full.txt for AI ingestion#70
Merged
Conversation
Contributor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Concatenates all published English pages into a single ~43KB markdown document at /llms-full.txt. Drafts and hidden pages excluded. Ordered by section (concepts → guides → curated → evangelism) then title. Each entry carries a structured header (section, URL, last-updated, summary) so AI crawlers can anchor citations back to specific pages within the aggregate document. Complements llms.txt (link index) and sitemap.xml (machine-readable URL list). llms.txt now advertises llms-full.txt in its header.
494a2ad to
72743cd
Compare
wilsonwangdev
added a commit
that referenced
this pull request
May 8, 2026
## Summary Closeout for the AEO baseline workstream (#69, #70). Captures the process failure that surfaced mid-session and updates the pre-work checklist to prevent recurrence. ## Why this exists While shipping #69 and #70, both PRs initially carried two orphan commits from a stale local branch whose contents had already squash-merged to `main` as #68. Different SHAs hid the equivalence from `git fetch -p`. The reviewer caught it; both PRs were rebased onto `origin/main` and force-pushed. The journal entry is the third PR-hygiene failure recorded. The pattern across all three: rules exist, but the procedural trigger points where rules must fire are under-specified. ## Changes - `journal/2026-05-08-stale-branch-base-pollution.md` — what happened, root cause, fix, evolution notes (EN+ZH). - `AGENTS.md` Git Workflow — new pre-work step 9: explicitly verify `git branch --show-current` returns `main` right before `git checkout -b`. Squash-merge makes stale-branch detection unreliable via SHA; this is a terminal-state assertion rather than a sequence of actions. - `ROADMAP.md` — check off #67, #69, #70; add "AI visibility / AEO content guide" as the next content item (the planned Layer 1 guide that closes this workstream). ## Manual follow-ups (tracked separately, not in this PR) External submissions that require login and cannot be automated from the repo: - [directory.llmstxt.cloud](https://directory.llmstxt.cloud) / [llmstxt.site](https://llmstxt.site) / [llms-txt-hub](https://github.com/thedaviddias/llms-txt-hub) — submit llms.txt URL - Bing Webmaster Tools — register + submit sitemap (unlocks AI Performance panel with citation counts) - Google Search Console — verify AI Overview citation visibility - [Perplexity Publisher Program](https://pplx.ai/publisher-program) — submit site for source consideration These will be handled out-of-band. The handbook's technical readiness for them shipped in #69 and #70.
wilsonwangdev
added a commit
that referenced
this pull request
May 8, 2026
## Summary Closes the AEO workstream started in #69 (baseline: robots.txt AI allow-list, JSON-LD differentiation, identity graph) and #70 (llms-full.txt). This is the Layer 1 content deliverable — a practitioner-facing guide on making any site legible to AI answer engines. ## What the guide covers Five signals and the canonical source for each: 1. **llms.txt** — root-level Markdown map (linked to the [llmstxt.org](https://llmstxt.org/) spec by Jeremy Howard). 2. **llms-full.txt** — full-content concatenation for one-shot ingestion. 3. **robots.txt AI bot explicit allow-list** — linked to each vendor's own bot documentation (OpenAI, Anthropic, Perplexity, Google). 4. **JSON-LD differentiation** — `TechArticle`, `HowTo`, `citation`, `@graph`; linked to [schema.org](https://schema.org/) and [Google's structured data docs](https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data). 5. **Submissions** — Bing Webmaster (AI Performance panel), Google Search Console, `llms.txt` directories, Perplexity Publishers. Plus an explicit **What to Avoid** section: spam-for-AI (generating low-effort content to game citations) is called out as the failure mode that this discipline is sometimes confused with. ## Source traceability Per the handbook's source-traceability rule, every specific practice links to its primary source. A final **Inspiration** section acknowledges [Tw93's May 2026 post](https://x.com/HiTw93/status/2049868069208768812) as the synthesis that triggered this handbook's own AEO baseline work, without making it a substitute for the canonical sources. ## Layer positioning (SPEC 006) **Layer 1 Universal Content** — no hardcoded references to this project in the main body. Where the handbook's own implementation illustrates a practice, it appears in clearly marked callout blocks (for example, the `build.mjs` reference under the llms.txt section), not as the guide's main thread. ## Tested - `SITE_URL=https://agent-master-handbook.vercel.app npm run build` — 30 pages built (was 28, +2 for EN/ZH). - Verified EN page emits `TechArticle` schema with author and publisher, matches other guides. - Verified the new guide is indexed in `llms.txt` and included in `llms-full.txt` (which grew from 43KB to 56KB). - No CSS, template, or interactive component changes, so frontend interaction review does not apply. ## ROADMAP Item will be checked off when this PR merges, following the same convention as other published guides.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
/llms-full.txt— a single markdown document concatenating all published English pages, optimized for AI crawlers that prefer one-shot ingestion over link-following.Complements #69 (llms.txt as link index, JSON-LD, robots.txt AI allow-list, merged) by giving AI answer engines a single 43KB file containing the full handbook body.
Why this exists
The llms.txt specification defines two variants:
llms.txt— concise link index (already shipped in infra: strengthen AEO baseline for AI answer engines #69)llms-full.txt— full content, one documentAnswer engines that allocate constrained scrape budgets (ChatGPT search, Perplexity, Bing Copilot) often prefer the full version when available because it avoids per-page HTTP overhead and ambiguous ranking across small pages.
Contents of the generated file
llms.txtindex)status: published— drafts are excluded from AI indexes## <title>anchor) to avoid duplicate headings---) between pages so AI can parse boundariesSize: 43KB for current 12 published EN pages. Well under any practical scrape-budget threshold; will grow roughly linearly with published content.
Implementation notes
buildPagesnow returns the raw body per page (newbodyfield on thepagesarray). Zero extra I/O — we already have it in memory during page rendering.buildLlmsFullTxtlives next tobuildLlmsTxtand runs in the same parallelPromise.allbatch inmain.llms.txtheader already advertisesllms-full.txt(shipped via infra: strengthen AEO baseline for AI answer engines #69) so crawlers starting from the index discover it.Tested
SITE_URL=https://agent-master-handbook.vercel.app npm run build— 28 pages built, 43,859 byte llms-full.txt.