infra: strengthen AEO baseline for AI answer engines#69
Merged
Conversation
Contributor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
- robots.txt explicitly allow-lists 17 AI training and search crawlers (GPTBot, ClaudeBot, OAI-SearchBot, Claude-SearchBot, PerplexityBot, Google-Extended, etc.) as an intent signal, not just permission. - JSON-LD differentiates by section: concepts/guides emit TechArticle, curated/evangelism emit Article. Curated pages now emit citation and isBasedOn pointing to the source URL from frontmatter. - Homepage emits an @graph linking WebSite, Organization, Person, and WebPage so AI engines can answer "who publishes X" without guessing. - package.json gains an author block as single source of truth for identity in JSON-LD and llms.txt. - llms.txt header now declares Author, Repository, and Homepage so the AI-readable index carries publisher identity. - AGENTS.md Quality Baseline documents the new conventions so future agents enforce them.
2cf62a3 to
4d59309
Compare
wilsonwangdev
added a commit
that referenced
this pull request
May 8, 2026
## Summary Adds `/llms-full.txt` — a single markdown document concatenating all published English pages, optimized for AI crawlers that prefer one-shot ingestion over link-following. Complements #69 (llms.txt as link index, JSON-LD, robots.txt AI allow-list, merged) by giving AI answer engines a single 43KB file containing the full handbook body. ## Why this exists The [llms.txt specification](https://llmstxt.org/) defines two variants: - `llms.txt` — concise link index (already shipped in #69) - `llms-full.txt` — full content, one document Answer engines that allocate constrained scrape budgets (ChatGPT search, Perplexity, Bing Copilot) often prefer the full version when available because it avoids per-page HTTP overhead and ambiguous ranking across small pages. ## Contents of the generated file - Header with site identity (author, repo, homepage, link back to `llms.txt` index) - Only pages with `status: published` — drafts are excluded from AI indexes - Ordered by section (concepts → guides → curated → evangelism) then title - Per-page structured header before body: section, canonical URL, last-updated, summary - Source H1 stripped (replaced by the structured `## <title>` anchor) to avoid duplicate headings - Separators (`---`) between pages so AI can parse boundaries Size: 43KB for current 12 published EN pages. Well under any practical scrape-budget threshold; will grow roughly linearly with published content. ## Implementation notes - `buildPages` now returns the raw body per page (new `body` field on the `pages` array). Zero extra I/O — we already have it in memory during page rendering. - `buildLlmsFullTxt` lives next to `buildLlmsTxt` and runs in the same parallel `Promise.all` batch in `main`. - `llms.txt` header already advertises `llms-full.txt` (shipped via #69) so crawlers starting from the index discover it. - AGENTS.md documents the status-filter contract so future agents know drafts are excluded by design. ## Tested - `SITE_URL=https://agent-master-handbook.vercel.app npm run build` — 28 pages built, 43,859 byte llms-full.txt. - Verified: 12 published pages included (2 concepts + 5 guides + 4 curated + 1 evangelism), correct section ordering, structured headers on each entry, no duplicate H1s after source-H1 strip.
wilsonwangdev
added a commit
that referenced
this pull request
May 8, 2026
## Summary Closeout for the AEO baseline workstream (#69, #70). Captures the process failure that surfaced mid-session and updates the pre-work checklist to prevent recurrence. ## Why this exists While shipping #69 and #70, both PRs initially carried two orphan commits from a stale local branch whose contents had already squash-merged to `main` as #68. Different SHAs hid the equivalence from `git fetch -p`. The reviewer caught it; both PRs were rebased onto `origin/main` and force-pushed. The journal entry is the third PR-hygiene failure recorded. The pattern across all three: rules exist, but the procedural trigger points where rules must fire are under-specified. ## Changes - `journal/2026-05-08-stale-branch-base-pollution.md` — what happened, root cause, fix, evolution notes (EN+ZH). - `AGENTS.md` Git Workflow — new pre-work step 9: explicitly verify `git branch --show-current` returns `main` right before `git checkout -b`. Squash-merge makes stale-branch detection unreliable via SHA; this is a terminal-state assertion rather than a sequence of actions. - `ROADMAP.md` — check off #67, #69, #70; add "AI visibility / AEO content guide" as the next content item (the planned Layer 1 guide that closes this workstream). ## Manual follow-ups (tracked separately, not in this PR) External submissions that require login and cannot be automated from the repo: - [directory.llmstxt.cloud](https://directory.llmstxt.cloud) / [llmstxt.site](https://llmstxt.site) / [llms-txt-hub](https://github.com/thedaviddias/llms-txt-hub) — submit llms.txt URL - Bing Webmaster Tools — register + submit sitemap (unlocks AI Performance panel with citation counts) - Google Search Console — verify AI Overview citation visibility - [Perplexity Publisher Program](https://pplx.ai/publisher-program) — submit site for source consideration These will be handled out-of-band. The handbook's technical readiness for them shipped in #69 and #70.
wilsonwangdev
added a commit
that referenced
this pull request
May 8, 2026
## Summary Closes the AEO workstream started in #69 (baseline: robots.txt AI allow-list, JSON-LD differentiation, identity graph) and #70 (llms-full.txt). This is the Layer 1 content deliverable — a practitioner-facing guide on making any site legible to AI answer engines. ## What the guide covers Five signals and the canonical source for each: 1. **llms.txt** — root-level Markdown map (linked to the [llmstxt.org](https://llmstxt.org/) spec by Jeremy Howard). 2. **llms-full.txt** — full-content concatenation for one-shot ingestion. 3. **robots.txt AI bot explicit allow-list** — linked to each vendor's own bot documentation (OpenAI, Anthropic, Perplexity, Google). 4. **JSON-LD differentiation** — `TechArticle`, `HowTo`, `citation`, `@graph`; linked to [schema.org](https://schema.org/) and [Google's structured data docs](https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data). 5. **Submissions** — Bing Webmaster (AI Performance panel), Google Search Console, `llms.txt` directories, Perplexity Publishers. Plus an explicit **What to Avoid** section: spam-for-AI (generating low-effort content to game citations) is called out as the failure mode that this discipline is sometimes confused with. ## Source traceability Per the handbook's source-traceability rule, every specific practice links to its primary source. A final **Inspiration** section acknowledges [Tw93's May 2026 post](https://x.com/HiTw93/status/2049868069208768812) as the synthesis that triggered this handbook's own AEO baseline work, without making it a substitute for the canonical sources. ## Layer positioning (SPEC 006) **Layer 1 Universal Content** — no hardcoded references to this project in the main body. Where the handbook's own implementation illustrates a practice, it appears in clearly marked callout blocks (for example, the `build.mjs` reference under the llms.txt section), not as the guide's main thread. ## Tested - `SITE_URL=https://agent-master-handbook.vercel.app npm run build` — 30 pages built (was 28, +2 for EN/ZH). - Verified EN page emits `TechArticle` schema with author and publisher, matches other guides. - Verified the new guide is indexed in `llms.txt` and included in `llms-full.txt` (which grew from 43KB to 56KB). - No CSS, template, or interactive component changes, so frontend interaction review does not apply. ## ROADMAP Item will be checked off when this PR merges, following the same convention as other published guides.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Strengthens the site's Answer Engine Optimization (AEO / GEO) baseline so AI answer engines (ChatGPT, Claude, Perplexity, Google AI Overview, Bing Copilot) can discover, attribute, and cite handbook content more accurately.
Builds on #29 (SEO foundation) and #30 (discovery artifacts). Not a new direction — a deepening of existing work as AI answer engines become a material traffic channel.
Changes
robots.txt — explicit allow-list for 17 AI training and search crawlers (GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, Googlebot, Bingbot, Applebot, Applebot-Extended, DuckAssistBot, Meta-ExternalAgent, cohere-ai). Existing
User-agent: *already allowed these implicitly — the explicit listing is an intent signal, not a permission change.JSON-LD differentiation:
TechArticleinstead of genericArticle.citationandisBasedOnpointing to the source URL from existingsource:frontmatter, so AI citations can trace to primary sources.author(Person) andpublisher(Organization).Homepage
@graph— linksWebSite↔Organization↔Person↔WebPagevia@idreferences, so AI engines answering "who publishes Agent Master Handbook" or "who is the author" get structured ground truth.package.json— addsauthorblock as single source of truth for identity used by JSON-LD and llms.txt. No hardcoding in templates or build.mjs.llms.txt— header now declares Author, Repository, and Homepage so the AI-readable index carries publisher identity on first read.AGENTS.md Quality Baseline — documents the new conventions so future agents enforce them (curated
source:requirement, SECTION_SCHEMA mapping, robots.txt explicit-listing rule, identity source-of-truth rule).Tested
SITE_URL=https://agent-master-handbook.vercel.app npm run build— 28 pages built, no errors.Not in this PR
llms-full.txtgeneration — separate PR (build/llms-full) to avoid mixing concerns.build/build.mjsgrew to 489 lines (past the ~300 ceiling noted in AGENTS.md). The JSON-LD block is cohesive enough to extract intobuild/seo.mjs— flagged as a follow-up refactor PR, not mixed into this one.