infra: strengthen AEO baseline for AI answer engines by wilsonwangdev · Pull Request #69 · wilsonwangdev/agent-master-handbook

wilsonwangdev · 2026-05-08T14:38:26Z

Summary

Strengthens the site's Answer Engine Optimization (AEO / GEO) baseline so AI answer engines (ChatGPT, Claude, Perplexity, Google AI Overview, Bing Copilot) can discover, attribute, and cite handbook content more accurately.

Builds on #29 (SEO foundation) and #30 (discovery artifacts). Not a new direction — a deepening of existing work as AI answer engines become a material traffic channel.

Changes

robots.txt — explicit allow-list for 17 AI training and search crawlers (GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, Googlebot, Bingbot, Applebot, Applebot-Extended, DuckAssistBot, Meta-ExternalAgent, cohere-ai). Existing User-agent: * already allowed these implicitly — the explicit listing is an intent signal, not a permission change.

JSON-LD differentiation:

Concepts and guides emit TechArticle instead of generic Article.
Curated entries emit citation and isBasedOn pointing to the source URL from existing source: frontmatter, so AI citations can trace to primary sources.
All pages emit author (Person) and publisher (Organization).

Homepage @graph — links WebSite ↔ Organization ↔ Person ↔ WebPage via @id references, so AI engines answering "who publishes Agent Master Handbook" or "who is the author" get structured ground truth.

package.json — adds author block as single source of truth for identity used by JSON-LD and llms.txt. No hardcoding in templates or build.mjs.

llms.txt — header now declares Author, Repository, and Homepage so the AI-readable index carries publisher identity on first read.

AGENTS.md Quality Baseline — documents the new conventions so future agents enforce them (curated source: requirement, SECTION_SCHEMA mapping, robots.txt explicit-listing rule, identity source-of-truth rule).

Tested

SITE_URL=https://agent-master-handbook.vercel.app npm run build — 28 pages built, no errors.
Manually inspected JSON-LD on concept page (TechArticle + author + publisher), curated page (Article + citation + isBasedOn), and homepage (@graph with WebSite/Organization/Person/WebPage).
Verified robots.txt lists 17 AI user-agents + wildcard fallback + sitemap.

Not in this PR

llms-full.txt generation — separate PR (build/llms-full) to avoid mixing concerns.
External submissions (Bing Webmaster, llmstxt.cloud directory, Perplexity Publisher) — operational, tracked in journal.
AEO/GEO content guide — separate content PR.
build/build.mjs grew to 489 lines (past the ~300 ceiling noted in AGENTS.md). The JSON-LD block is cohesive enough to extract into build/seo.mjs — flagged as a follow-up refactor PR, not mixed into this one.

vercel · 2026-05-08T14:38:31Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agent-master-handbook	Ready	Preview, Comment	May 8, 2026 2:54pm

@graph

- robots.txt explicitly allow-lists 17 AI training and search crawlers (GPTBot, ClaudeBot, OAI-SearchBot, Claude-SearchBot, PerplexityBot, Google-Extended, etc.) as an intent signal, not just permission. - JSON-LD differentiates by section: concepts/guides emit TechArticle, curated/evangelism emit Article. Curated pages now emit citation and isBasedOn pointing to the source URL from frontmatter. - Homepage emits an @graph linking WebSite, Organization, Person, and WebPage so AI engines can answer "who publishes X" without guessing. - package.json gains an author block as single source of truth for identity in JSON-LD and llms.txt. - llms.txt header now declares Author, Repository, and Homepage so the AI-readable index carries publisher identity. - AGENTS.md Quality Baseline documents the new conventions so future agents enforce them.

## Summary Adds `/llms-full.txt` — a single markdown document concatenating all published English pages, optimized for AI crawlers that prefer one-shot ingestion over link-following. Complements #69 (llms.txt as link index, JSON-LD, robots.txt AI allow-list, merged) by giving AI answer engines a single 43KB file containing the full handbook body. ## Why this exists The [llms.txt specification](https://llmstxt.org/) defines two variants: - `llms.txt` — concise link index (already shipped in #69) - `llms-full.txt` — full content, one document Answer engines that allocate constrained scrape budgets (ChatGPT search, Perplexity, Bing Copilot) often prefer the full version when available because it avoids per-page HTTP overhead and ambiguous ranking across small pages. ## Contents of the generated file - Header with site identity (author, repo, homepage, link back to `llms.txt` index) - Only pages with `status: published` — drafts are excluded from AI indexes - Ordered by section (concepts → guides → curated → evangelism) then title - Per-page structured header before body: section, canonical URL, last-updated, summary - Source H1 stripped (replaced by the structured `## <title>` anchor) to avoid duplicate headings - Separators (`---`) between pages so AI can parse boundaries Size: 43KB for current 12 published EN pages. Well under any practical scrape-budget threshold; will grow roughly linearly with published content. ## Implementation notes - `buildPages` now returns the raw body per page (new `body` field on the `pages` array). Zero extra I/O — we already have it in memory during page rendering. - `buildLlmsFullTxt` lives next to `buildLlmsTxt` and runs in the same parallel `Promise.all` batch in `main`. - `llms.txt` header already advertises `llms-full.txt` (shipped via #69) so crawlers starting from the index discover it. - AGENTS.md documents the status-filter contract so future agents know drafts are excluded by design. ## Tested - `SITE_URL=https://agent-master-handbook.vercel.app npm run build` — 28 pages built, 43,859 byte llms-full.txt. - Verified: 12 published pages included (2 concepts + 5 guides + 4 curated + 1 evangelism), correct section ordering, structured headers on each entry, no duplicate H1s after source-H1 strip.

## Summary Closeout for the AEO baseline workstream (#69, #70). Captures the process failure that surfaced mid-session and updates the pre-work checklist to prevent recurrence. ## Why this exists While shipping #69 and #70, both PRs initially carried two orphan commits from a stale local branch whose contents had already squash-merged to `main` as #68. Different SHAs hid the equivalence from `git fetch -p`. The reviewer caught it; both PRs were rebased onto `origin/main` and force-pushed. The journal entry is the third PR-hygiene failure recorded. The pattern across all three: rules exist, but the procedural trigger points where rules must fire are under-specified. ## Changes - `journal/2026-05-08-stale-branch-base-pollution.md` — what happened, root cause, fix, evolution notes (EN+ZH). - `AGENTS.md` Git Workflow — new pre-work step 9: explicitly verify `git branch --show-current` returns `main` right before `git checkout -b`. Squash-merge makes stale-branch detection unreliable via SHA; this is a terminal-state assertion rather than a sequence of actions. - `ROADMAP.md` — check off #67, #69, #70; add "AI visibility / AEO content guide" as the next content item (the planned Layer 1 guide that closes this workstream). ## Manual follow-ups (tracked separately, not in this PR) External submissions that require login and cannot be automated from the repo: - [directory.llmstxt.cloud](https://directory.llmstxt.cloud) / [llmstxt.site](https://llmstxt.site) / [llms-txt-hub](https://github.com/thedaviddias/llms-txt-hub) — submit llms.txt URL - Bing Webmaster Tools — register + submit sitemap (unlocks AI Performance panel with citation counts) - Google Search Console — verify AI Overview citation visibility - [Perplexity Publisher Program](https://pplx.ai/publisher-program) — submit site for source consideration These will be handled out-of-band. The handbook's technical readiness for them shipped in #69 and #70.

## Summary Closes the AEO workstream started in #69 (baseline: robots.txt AI allow-list, JSON-LD differentiation, identity graph) and #70 (llms-full.txt). This is the Layer 1 content deliverable — a practitioner-facing guide on making any site legible to AI answer engines. ## What the guide covers Five signals and the canonical source for each: 1. **llms.txt** — root-level Markdown map (linked to the [llmstxt.org](https://llmstxt.org/) spec by Jeremy Howard). 2. **llms-full.txt** — full-content concatenation for one-shot ingestion. 3. **robots.txt AI bot explicit allow-list** — linked to each vendor's own bot documentation (OpenAI, Anthropic, Perplexity, Google). 4. **JSON-LD differentiation** — `TechArticle`, `HowTo`, `citation`, `@graph`; linked to [schema.org](https://schema.org/) and [Google's structured data docs](https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data). 5. **Submissions** — Bing Webmaster (AI Performance panel), Google Search Console, `llms.txt` directories, Perplexity Publishers. Plus an explicit **What to Avoid** section: spam-for-AI (generating low-effort content to game citations) is called out as the failure mode that this discipline is sometimes confused with. ## Source traceability Per the handbook's source-traceability rule, every specific practice links to its primary source. A final **Inspiration** section acknowledges [Tw93's May 2026 post](https://x.com/HiTw93/status/2049868069208768812) as the synthesis that triggered this handbook's own AEO baseline work, without making it a substitute for the canonical sources. ## Layer positioning (SPEC 006) **Layer 1 Universal Content** — no hardcoded references to this project in the main body. Where the handbook's own implementation illustrates a practice, it appears in clearly marked callout blocks (for example, the `build.mjs` reference under the llms.txt section), not as the guide's main thread. ## Tested - `SITE_URL=https://agent-master-handbook.vercel.app npm run build` — 30 pages built (was 28, +2 for EN/ZH). - Verified EN page emits `TechArticle` schema with author and publisher, matches other guides. - Verified the new guide is indexed in `llms.txt` and included in `llms-full.txt` (which grew from 43KB to 56KB). - No CSS, template, or interactive component changes, so frontend interaction review does not apply. ## ROADMAP Item will be checked off when this PR merges, following the same convention as other published guides.

wilsonwangdev force-pushed the infra/aeo-baseline branch from 2cf62a3 to 4d59309 Compare May 8, 2026 14:54

vercel Bot deployed to Preview May 8, 2026 14:54 View deployment

wilsonwangdev mentioned this pull request May 8, 2026

build: generate llms-full.txt for AI ingestion #70

Merged

wilsonwangdev merged commit 85d21cf into main May 8, 2026
3 checks passed

wilsonwangdev deleted the infra/aeo-baseline branch May 8, 2026 14:59

wilsonwangdev mentioned this pull request May 8, 2026

journal: record stale-branch base pollution from PR #69/#70 #71

Merged

wilsonwangdev mentioned this pull request May 8, 2026

content: add AI visibility / AEO guide (EN+ZH) #72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra: strengthen AEO baseline for AI answer engines#69

infra: strengthen AEO baseline for AI answer engines#69
wilsonwangdev merged 1 commit into
mainfrom
infra/aeo-baseline

wilsonwangdev commented May 8, 2026

Uh oh!

vercel Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wilsonwangdev commented May 8, 2026

Summary

Changes

Tested

Not in this PR

Uh oh!

vercel Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 8, 2026 •

edited

Loading