Skip to content

build: generate llms-full.txt for AI ingestion#70

Merged
wilsonwangdev merged 1 commit into
mainfrom
build/llms-full
May 8, 2026
Merged

build: generate llms-full.txt for AI ingestion#70
wilsonwangdev merged 1 commit into
mainfrom
build/llms-full

Conversation

@wilsonwangdev
Copy link
Copy Markdown
Owner

@wilsonwangdev wilsonwangdev commented May 8, 2026

Summary

Adds /llms-full.txt — a single markdown document concatenating all published English pages, optimized for AI crawlers that prefer one-shot ingestion over link-following.

Complements #69 (llms.txt as link index, JSON-LD, robots.txt AI allow-list, merged) by giving AI answer engines a single 43KB file containing the full handbook body.

Why this exists

The llms.txt specification defines two variants:

Answer engines that allocate constrained scrape budgets (ChatGPT search, Perplexity, Bing Copilot) often prefer the full version when available because it avoids per-page HTTP overhead and ambiguous ranking across small pages.

Contents of the generated file

  • Header with site identity (author, repo, homepage, link back to llms.txt index)
  • Only pages with status: published — drafts are excluded from AI indexes
  • Ordered by section (concepts → guides → curated → evangelism) then title
  • Per-page structured header before body: section, canonical URL, last-updated, summary
  • Source H1 stripped (replaced by the structured ## <title> anchor) to avoid duplicate headings
  • Separators (---) between pages so AI can parse boundaries

Size: 43KB for current 12 published EN pages. Well under any practical scrape-budget threshold; will grow roughly linearly with published content.

Implementation notes

  • buildPages now returns the raw body per page (new body field on the pages array). Zero extra I/O — we already have it in memory during page rendering.
  • buildLlmsFullTxt lives next to buildLlmsTxt and runs in the same parallel Promise.all batch in main.
  • llms.txt header already advertises llms-full.txt (shipped via infra: strengthen AEO baseline for AI answer engines #69) so crawlers starting from the index discover it.
  • AGENTS.md documents the status-filter contract so future agents know drafts are excluded by design.

Tested

  • SITE_URL=https://agent-master-handbook.vercel.app npm run build — 28 pages built, 43,859 byte llms-full.txt.
  • Verified: 12 published pages included (2 concepts + 5 guides + 4 curated + 1 evangelism), correct section ordering, structured headers on each entry, no duplicate H1s after source-H1 strip.

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agent-master-handbook Ready Ready Preview, Comment May 8, 2026 3:03pm

Base automatically changed from infra/aeo-baseline to main May 8, 2026 14:59
Concatenates all published English pages into a single ~43KB markdown
document at /llms-full.txt. Drafts and hidden pages excluded. Ordered
by section (concepts → guides → curated → evangelism) then title.

Each entry carries a structured header (section, URL, last-updated,
summary) so AI crawlers can anchor citations back to specific pages
within the aggregate document.

Complements llms.txt (link index) and sitemap.xml (machine-readable
URL list). llms.txt now advertises llms-full.txt in its header.
@wilsonwangdev wilsonwangdev merged commit 2c18718 into main May 8, 2026
3 checks passed
@wilsonwangdev wilsonwangdev deleted the build/llms-full branch May 8, 2026 15:07
wilsonwangdev added a commit that referenced this pull request May 8, 2026
## Summary

Closeout for the AEO baseline workstream (#69, #70). Captures the
process failure that surfaced mid-session and updates the pre-work
checklist to prevent recurrence.

## Why this exists

While shipping #69 and #70, both PRs initially carried two orphan
commits from a stale local branch whose contents had already
squash-merged to `main` as #68. Different SHAs hid the equivalence from
`git fetch -p`. The reviewer caught it; both PRs were rebased onto
`origin/main` and force-pushed.

The journal entry is the third PR-hygiene failure recorded. The pattern
across all three: rules exist, but the procedural trigger points where
rules must fire are under-specified.

## Changes

- `journal/2026-05-08-stale-branch-base-pollution.md` — what happened,
root cause, fix, evolution notes (EN+ZH).
- `AGENTS.md` Git Workflow — new pre-work step 9: explicitly verify `git
branch --show-current` returns `main` right before `git checkout -b`.
Squash-merge makes stale-branch detection unreliable via SHA; this is a
terminal-state assertion rather than a sequence of actions.
- `ROADMAP.md` — check off #67, #69, #70; add "AI visibility / AEO
content guide" as the next content item (the planned Layer 1 guide that
closes this workstream).

## Manual follow-ups (tracked separately, not in this PR)

External submissions that require login and cannot be automated from the
repo:
- [directory.llmstxt.cloud](https://directory.llmstxt.cloud) /
[llmstxt.site](https://llmstxt.site) /
[llms-txt-hub](https://github.com/thedaviddias/llms-txt-hub) — submit
llms.txt URL
- Bing Webmaster Tools — register + submit sitemap (unlocks AI
Performance panel with citation counts)
- Google Search Console — verify AI Overview citation visibility
- [Perplexity Publisher Program](https://pplx.ai/publisher-program) —
submit site for source consideration

These will be handled out-of-band. The handbook's technical readiness
for them shipped in #69 and #70.
wilsonwangdev added a commit that referenced this pull request May 8, 2026
## Summary

Closes the AEO workstream started in #69 (baseline: robots.txt AI
allow-list, JSON-LD differentiation, identity graph) and #70
(llms-full.txt). This is the Layer 1 content deliverable — a
practitioner-facing guide on making any site legible to AI answer
engines.

## What the guide covers

Five signals and the canonical source for each:

1. **llms.txt** — root-level Markdown map (linked to the
[llmstxt.org](https://llmstxt.org/) spec by Jeremy Howard).
2. **llms-full.txt** — full-content concatenation for one-shot
ingestion.
3. **robots.txt AI bot explicit allow-list** — linked to each vendor's
own bot documentation (OpenAI, Anthropic, Perplexity, Google).
4. **JSON-LD differentiation** — `TechArticle`, `HowTo`, `citation`,
`@graph`; linked to [schema.org](https://schema.org/) and [Google's
structured data
docs](https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data).
5. **Submissions** — Bing Webmaster (AI Performance panel), Google
Search Console, `llms.txt` directories, Perplexity Publishers.

Plus an explicit **What to Avoid** section: spam-for-AI (generating
low-effort content to game citations) is called out as the failure mode
that this discipline is sometimes confused with.

## Source traceability

Per the handbook's source-traceability rule, every specific practice
links to its primary source. A final **Inspiration** section
acknowledges [Tw93's May 2026
post](https://x.com/HiTw93/status/2049868069208768812) as the synthesis
that triggered this handbook's own AEO baseline work, without making it
a substitute for the canonical sources.

## Layer positioning (SPEC 006)

**Layer 1 Universal Content** — no hardcoded references to this project
in the main body. Where the handbook's own implementation illustrates a
practice, it appears in clearly marked callout blocks (for example, the
`build.mjs` reference under the llms.txt section), not as the guide's
main thread.

## Tested

- `SITE_URL=https://agent-master-handbook.vercel.app npm run build` — 30
pages built (was 28, +2 for EN/ZH).
- Verified EN page emits `TechArticle` schema with author and publisher,
matches other guides.
- Verified the new guide is indexed in `llms.txt` and included in
`llms-full.txt` (which grew from 43KB to 56KB).
- No CSS, template, or interactive component changes, so frontend
interaction review does not apply.

## ROADMAP

Item will be checked off when this PR merges, following the same
convention as other published guides.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant