Skip to content

infra: strengthen AEO baseline for AI answer engines#69

Merged
wilsonwangdev merged 1 commit into
mainfrom
infra/aeo-baseline
May 8, 2026
Merged

infra: strengthen AEO baseline for AI answer engines#69
wilsonwangdev merged 1 commit into
mainfrom
infra/aeo-baseline

Conversation

@wilsonwangdev
Copy link
Copy Markdown
Owner

Summary

Strengthens the site's Answer Engine Optimization (AEO / GEO) baseline so AI answer engines (ChatGPT, Claude, Perplexity, Google AI Overview, Bing Copilot) can discover, attribute, and cite handbook content more accurately.

Builds on #29 (SEO foundation) and #30 (discovery artifacts). Not a new direction — a deepening of existing work as AI answer engines become a material traffic channel.

Changes

robots.txt — explicit allow-list for 17 AI training and search crawlers (GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, Googlebot, Bingbot, Applebot, Applebot-Extended, DuckAssistBot, Meta-ExternalAgent, cohere-ai). Existing User-agent: * already allowed these implicitly — the explicit listing is an intent signal, not a permission change.

JSON-LD differentiation:

  • Concepts and guides emit TechArticle instead of generic Article.
  • Curated entries emit citation and isBasedOn pointing to the source URL from existing source: frontmatter, so AI citations can trace to primary sources.
  • All pages emit author (Person) and publisher (Organization).

Homepage @graph — links WebSiteOrganizationPersonWebPage via @id references, so AI engines answering "who publishes Agent Master Handbook" or "who is the author" get structured ground truth.

package.json — adds author block as single source of truth for identity used by JSON-LD and llms.txt. No hardcoding in templates or build.mjs.

llms.txt — header now declares Author, Repository, and Homepage so the AI-readable index carries publisher identity on first read.

AGENTS.md Quality Baseline — documents the new conventions so future agents enforce them (curated source: requirement, SECTION_SCHEMA mapping, robots.txt explicit-listing rule, identity source-of-truth rule).

Tested

  • SITE_URL=https://agent-master-handbook.vercel.app npm run build — 28 pages built, no errors.
  • Manually inspected JSON-LD on concept page (TechArticle + author + publisher), curated page (Article + citation + isBasedOn), and homepage (@graph with WebSite/Organization/Person/WebPage).
  • Verified robots.txt lists 17 AI user-agents + wildcard fallback + sitemap.

Not in this PR

  • llms-full.txt generation — separate PR (build/llms-full) to avoid mixing concerns.
  • External submissions (Bing Webmaster, llmstxt.cloud directory, Perplexity Publisher) — operational, tracked in journal.
  • AEO/GEO content guide — separate content PR.
  • build/build.mjs grew to 489 lines (past the ~300 ceiling noted in AGENTS.md). The JSON-LD block is cohesive enough to extract into build/seo.mjs — flagged as a follow-up refactor PR, not mixed into this one.

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agent-master-handbook Ready Ready Preview, Comment May 8, 2026 2:54pm

- robots.txt explicitly allow-lists 17 AI training and search crawlers
  (GPTBot, ClaudeBot, OAI-SearchBot, Claude-SearchBot, PerplexityBot,
  Google-Extended, etc.) as an intent signal, not just permission.
- JSON-LD differentiates by section: concepts/guides emit TechArticle,
  curated/evangelism emit Article. Curated pages now emit citation and
  isBasedOn pointing to the source URL from frontmatter.
- Homepage emits an @graph linking WebSite, Organization, Person, and
  WebPage so AI engines can answer "who publishes X" without guessing.
- package.json gains an author block as single source of truth for
  identity in JSON-LD and llms.txt.
- llms.txt header now declares Author, Repository, and Homepage so the
  AI-readable index carries publisher identity.
- AGENTS.md Quality Baseline documents the new conventions so future
  agents enforce them.
@wilsonwangdev wilsonwangdev merged commit 85d21cf into main May 8, 2026
3 checks passed
@wilsonwangdev wilsonwangdev deleted the infra/aeo-baseline branch May 8, 2026 14:59
wilsonwangdev added a commit that referenced this pull request May 8, 2026
## Summary

Adds `/llms-full.txt` — a single markdown document concatenating all
published English pages, optimized for AI crawlers that prefer one-shot
ingestion over link-following.

Complements #69 (llms.txt as link index, JSON-LD, robots.txt AI
allow-list, merged) by giving AI answer engines a single 43KB file
containing the full handbook body.

## Why this exists

The [llms.txt specification](https://llmstxt.org/) defines two variants:
- `llms.txt` — concise link index (already shipped in #69)
- `llms-full.txt` — full content, one document

Answer engines that allocate constrained scrape budgets (ChatGPT search,
Perplexity, Bing Copilot) often prefer the full version when available
because it avoids per-page HTTP overhead and ambiguous ranking across
small pages.

## Contents of the generated file

- Header with site identity (author, repo, homepage, link back to
`llms.txt` index)
- Only pages with `status: published` — drafts are excluded from AI
indexes
- Ordered by section (concepts → guides → curated → evangelism) then
title
- Per-page structured header before body: section, canonical URL,
last-updated, summary
- Source H1 stripped (replaced by the structured `## <title>` anchor) to
avoid duplicate headings
- Separators (`---`) between pages so AI can parse boundaries

Size: 43KB for current 12 published EN pages. Well under any practical
scrape-budget threshold; will grow roughly linearly with published
content.

## Implementation notes

- `buildPages` now returns the raw body per page (new `body` field on
the `pages` array). Zero extra I/O — we already have it in memory during
page rendering.
- `buildLlmsFullTxt` lives next to `buildLlmsTxt` and runs in the same
parallel `Promise.all` batch in `main`.
- `llms.txt` header already advertises `llms-full.txt` (shipped via #69)
so crawlers starting from the index discover it.
- AGENTS.md documents the status-filter contract so future agents know
drafts are excluded by design.

## Tested

- `SITE_URL=https://agent-master-handbook.vercel.app npm run build` — 28
pages built, 43,859 byte llms-full.txt.
- Verified: 12 published pages included (2 concepts + 5 guides + 4
curated + 1 evangelism), correct section ordering, structured headers on
each entry, no duplicate H1s after source-H1 strip.
wilsonwangdev added a commit that referenced this pull request May 8, 2026
## Summary

Closeout for the AEO baseline workstream (#69, #70). Captures the
process failure that surfaced mid-session and updates the pre-work
checklist to prevent recurrence.

## Why this exists

While shipping #69 and #70, both PRs initially carried two orphan
commits from a stale local branch whose contents had already
squash-merged to `main` as #68. Different SHAs hid the equivalence from
`git fetch -p`. The reviewer caught it; both PRs were rebased onto
`origin/main` and force-pushed.

The journal entry is the third PR-hygiene failure recorded. The pattern
across all three: rules exist, but the procedural trigger points where
rules must fire are under-specified.

## Changes

- `journal/2026-05-08-stale-branch-base-pollution.md` — what happened,
root cause, fix, evolution notes (EN+ZH).
- `AGENTS.md` Git Workflow — new pre-work step 9: explicitly verify `git
branch --show-current` returns `main` right before `git checkout -b`.
Squash-merge makes stale-branch detection unreliable via SHA; this is a
terminal-state assertion rather than a sequence of actions.
- `ROADMAP.md` — check off #67, #69, #70; add "AI visibility / AEO
content guide" as the next content item (the planned Layer 1 guide that
closes this workstream).

## Manual follow-ups (tracked separately, not in this PR)

External submissions that require login and cannot be automated from the
repo:
- [directory.llmstxt.cloud](https://directory.llmstxt.cloud) /
[llmstxt.site](https://llmstxt.site) /
[llms-txt-hub](https://github.com/thedaviddias/llms-txt-hub) — submit
llms.txt URL
- Bing Webmaster Tools — register + submit sitemap (unlocks AI
Performance panel with citation counts)
- Google Search Console — verify AI Overview citation visibility
- [Perplexity Publisher Program](https://pplx.ai/publisher-program) —
submit site for source consideration

These will be handled out-of-band. The handbook's technical readiness
for them shipped in #69 and #70.
wilsonwangdev added a commit that referenced this pull request May 8, 2026
## Summary

Closes the AEO workstream started in #69 (baseline: robots.txt AI
allow-list, JSON-LD differentiation, identity graph) and #70
(llms-full.txt). This is the Layer 1 content deliverable — a
practitioner-facing guide on making any site legible to AI answer
engines.

## What the guide covers

Five signals and the canonical source for each:

1. **llms.txt** — root-level Markdown map (linked to the
[llmstxt.org](https://llmstxt.org/) spec by Jeremy Howard).
2. **llms-full.txt** — full-content concatenation for one-shot
ingestion.
3. **robots.txt AI bot explicit allow-list** — linked to each vendor's
own bot documentation (OpenAI, Anthropic, Perplexity, Google).
4. **JSON-LD differentiation** — `TechArticle`, `HowTo`, `citation`,
`@graph`; linked to [schema.org](https://schema.org/) and [Google's
structured data
docs](https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data).
5. **Submissions** — Bing Webmaster (AI Performance panel), Google
Search Console, `llms.txt` directories, Perplexity Publishers.

Plus an explicit **What to Avoid** section: spam-for-AI (generating
low-effort content to game citations) is called out as the failure mode
that this discipline is sometimes confused with.

## Source traceability

Per the handbook's source-traceability rule, every specific practice
links to its primary source. A final **Inspiration** section
acknowledges [Tw93's May 2026
post](https://x.com/HiTw93/status/2049868069208768812) as the synthesis
that triggered this handbook's own AEO baseline work, without making it
a substitute for the canonical sources.

## Layer positioning (SPEC 006)

**Layer 1 Universal Content** — no hardcoded references to this project
in the main body. Where the handbook's own implementation illustrates a
practice, it appears in clearly marked callout blocks (for example, the
`build.mjs` reference under the llms.txt section), not as the guide's
main thread.

## Tested

- `SITE_URL=https://agent-master-handbook.vercel.app npm run build` — 30
pages built (was 28, +2 for EN/ZH).
- Verified EN page emits `TechArticle` schema with author and publisher,
matches other guides.
- Verified the new guide is indexed in `llms.txt` and included in
`llms-full.txt` (which grew from 43KB to 56KB).
- No CSS, template, or interactive component changes, so frontend
interaction review does not apply.

## ROADMAP

Item will be checked off when this PR merges, following the same
convention as other published guides.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant