Skip to content

Latest commit

 

History

History
779 lines (620 loc) · 37.2 KB

File metadata and controls

779 lines (620 loc) · 37.2 KB

Scribe Platform Architecture

Last updated: 2026-03-09


System Overview

graph TD
    User("🌐 User Browser")

    User -->|"Sign in and access dashboard"| AppPlatform

    AppPlatform -->|"Write structured job request"| JobsCollection
    AppPlatform -->|"Read and write site config"| SitesUsers
    AppPlatform -.->|"Poll article status every 2.5s"| ArticlesCollection
    ArticlesCollection -.->|"Serve published content"| ContentSite

    subgraph MongoDB["🗄️ MongoDB Atlas"]
        JobsCollection("Jobs Collection")
        ArticlesCollection("Articles Collection")
        SitesUsers("Sites and Users Collections")
    end

    subgraph OpenClaw["🤖 OpenClaw Engine"]
        ScribeWalker("Scribe Walker, Agent Orchestration")
        ClaudeOpus("🧠 Claude Opus 4.6")
        DallE("🎨 DALL-E 3")
        WebResearch("🔍 Web Research")
    end

    JobsCollection -->|"Poll and pick up pending jobs"| ScribeWalker

    subgraph Vercel["☁️ Vercel Platform (Single Project)"]
        ContentSite("tryscribe.co — Marketing + Blog Subfolders")
        AppPlatform("app.tryscribe.co — Dashboard + API")
    end

    ScribeWalker -->|"Generate SEO articles"| ClaudeOpus
    ScribeWalker -->|"Generate featured images"| DallE
    ScribeWalker -->|"Research trending topics"| WebResearch
    ScribeWalker -->|"Write completed articles"| ArticlesCollection

    style User fill:#b3e5fc,stroke:#333,stroke-width:3px,color:#000
    style Vercel fill:#fff8f0,stroke:#d4920a,stroke-width:2px,color:#000
    style AppPlatform fill:#fff,stroke:#d4920a,stroke-width:3px,color:#000
    style ContentSite fill:#fff,stroke:#7a8c6e,stroke-width:2px,color:#000
    style MongoDB fill:#e8f5e9,stroke:#4caf50,stroke-width:2px,color:#000
    style JobsCollection fill:#c8e6c9,stroke:#333,stroke-width:2px,color:#000
    style ArticlesCollection fill:#c8e6c9,stroke:#333,stroke-width:2px,color:#000
    style SitesUsers fill:#c8e6c9,stroke:#333,stroke-width:2px,color:#000
    style OpenClaw fill:#fff3e0,stroke:#d4920a,stroke-width:2px,color:#000
    style ScribeWalker fill:#ffe0b2,stroke:#d4920a,stroke-width:3px,color:#000
    style ClaudeOpus fill:#f5f5f5,stroke:#333,stroke-width:2px,color:#000
    style DallE fill:#f5f5f5,stroke:#333,stroke-width:2px,color:#000
    style WebResearch fill:#f5f5f5,stroke:#333,stroke-width:2px,color:#000

    linkStyle default interpolate basis
Loading

Flow: User signs in → App writes structured job → OpenClaw polls and picks it up → Scribe Walker generates articles autonomously → Dashboard polls and displays results in real-time

Color Legend: 🔵 Light blue = User entry point · 🟠 Amber = Vercel platform · 🟢 Green = MongoDB data layer · 🟡 Warm = OpenClaw engine · ⚪ Gray = AI tools

Core Principle: MongoDB is the ONLY bridge between the app and OpenClaw. They never communicate directly. The app writes structured job requests; OpenClaw picks them up and executes autonomously.


Components

1. App (Next.js on Vercel)

  • Content URL: tryscribe.co — marketing site + blog subfolders
  • Dashboard URL: app.tryscribe.co — auth, onboarding, dashboard, billing
  • Deployment: Single Vercel project serves both domains
  • Role: User-facing platform — auth, onboarding, dashboard, billing, blog content
  • Responsibilities:
    • User authentication (NextAuth: magic link + Google OAuth)
    • Onboarding flow (brand name, niche, location)
    • Writing job requests to MongoDB
    • Polling article status and displaying results
    • Serving blog content via subfolders (tryscribe.co/{brand}/{slug})
    • Marketing site at root (tryscribe.co/)
    • Stripe billing and usage tracking
  • Does NOT: Generate articles, call AI APIs, run any agent logic

URL Architecture (Subfolder Model — Migrated Mar 9, 2026)

Why subfolders over subdomains:

  • tryscribe.co is a new domain with zero authority
  • Every article under tryscribe.co/{brand}/ consolidates keyword and backlink authority on the root domain
  • Subdomains (brand.tryscribe.co) would scatter SEO value across isolated domains
  • Sources: Cloudflare, Ahrefs, Semrush all lean subfolder for new domains

URL Structure:

URL Purpose
tryscribe.co/ Marketing site (static HTML served via middleware rewrite)
tryscribe.co/{brand}/ Brand blog home (e.g., tryscribe.co/sallys-spa)
tryscribe.co/{brand}/{slug} Individual article page
app.tryscribe.co/dashboard User dashboard
app.tryscribe.co/onboarding New user onboarding

Middleware Routing (src/middleware.ts):

  1. Legacy subdomain requests (brand.tryscribe.co) → 301 redirect to tryscribe.co/{brand}/
  2. App routes on content domain (tryscribe.co/dashboard) → 302 redirect to app.tryscribe.co/dashboard
  3. Root path on content domain (tryscribe.co/) → rewrite to /marketing.html
  4. Brand slug detection → rewrite /{brand}/{slug} to internal /blog/{brand}/{slug} route
  5. Reserved paths (api, auth, _next, etc.) pass through unchanged

Internal Route Structure: Blog pages live at src/app/blog/[subdomain]/ internally (the subdomain param name is kept for backward compatibility but represents the brand slug in the subfolder URL).

Domain Separation:

  • tryscribe.co = content site only (marketing + blog articles). Dashboard routes redirect to app.tryscribe.co.
  • app.tryscribe.co = dashboard + API. Content updates here don't risk breaking the content site.
  • Both served from the same Vercel project with middleware-based routing.

2. MongoDB Atlas

  • Cluster: ScribeCluster (currently M0 Free, AWS us-east-1)
  • Role: Shared data layer and job queue
  • Collections:
    • users — user accounts, plans, referrals
    • sites — brand configurations (niche, location, subdomain)
    • articles — generated content (status: generating/published/failed)
    • jobs — article generation job queue (NEW)
    • sessions, accounts — NextAuth session management

3. OpenClaw Instance (Scribe Walker)

  • Current host: Mac Mini (local development)
  • Future host: Linux VM (production)
  • Role: AI orchestration engine — the brains
  • Responsibilities:
    • Polling jobs collection for pending work
    • Running Scribe Walker agent sessions for each job
    • Research, writing, image generation, quality checks
    • Writing completed articles to articles collection
    • Updating job status (pending → processing → complete/failed)
  • Does NOT: Serve web traffic, handle user auth, manage billing

Job Queue Protocol

Job Schema

interface Job {
  _id: ObjectId;
  
  // Who requested it
  userId: ObjectId;
  siteId: ObjectId;
  
  // What to generate
  action: "generate";          // Typed enum — no freeform actions
  params: {
    brandName: string;         // From site record
    niche: string;             // From site record
    location?: string;         // From site seoConfig
    tone?: string;             // "professional" | "casual" | "authoritative"
    count: number;             // Number of articles (default: 3)
    topicStyles: string[];     // ["how-to", "tips", "why", "listicle", "guide"]
  };

  // Job lifecycle
  status: "pending" | "processing" | "complete" | "failed";
  priority: number;            // Lower = higher priority (default: 10)
  attempts: number;            // Retry count (default: 0)
  maxAttempts: number;         // Max retries (default: 3)
  
  // Results
  articleIds: ObjectId[];      // Populated as articles are created
  error?: string;              // Error message if failed
  
  // Timestamps
  createdAt: Date;
  startedAt?: Date;
  completedAt?: Date;
}

Allowed Actions (Typed Enum)

Only these actions are valid. OpenClaw rejects anything else:

Action Description Params
generate Generate new articles for a site brandName, niche, location, tone, count, topicStyles
rewrite Rewrite an existing article articleId, instructions (from predefined set)
refresh Generate more articles for existing site same as generate

No freeform prompts. No shell commands. No tool instructions. The job contains DATA, not INSTRUCTIONS. OpenClaw constructs its own prompts internally using SCRIBE-WALKER-CONTEXT.md and its agent reasoning.

Job Lifecycle Flow

1. User clicks "Summon Your Scribe ✒️"
2. App validates user auth + plan limits
3. App creates Job doc (status: "pending")
4. App creates placeholder Article docs (status: "generating")
5. App returns immediately — dashboard starts polling articles

6. OpenClaw polls jobs collection (every 5-10 seconds)
7. Picks up pending job, sets status: "processing", sets startedAt
8. Spawns Scribe Walker session with structured params
9. Scribe Walker:
   a. Researches relevant topics for the niche/location
   b. Writes SEO-optimized articles (Claude Opus 4.6)
   c. Generates DALL-E 3 featured images
   d. Quality checks (word count, SEO meta, no em dashes, etc.)
10. Updates Article docs: content, SEO meta, images, status: "published"
11. Updates Job doc: status: "complete", completedAt

12. Dashboard polling picks up published articles in real-time
13. User sees articles appear one by one (2-3 second poll interval)

Security Model

Threat: Compromised MongoDB Credentials

If an attacker gains access to the app's MongoDB connection string, they could write malicious jobs.

Mitigation 1: Strict Schema Validation

OpenClaw validates every job against the typed schema before processing:

  • action must be in the allowed enum
  • params must match the expected shape for that action
  • All string fields have max length limits
  • No nested objects beyond one level
  • Any invalid job is rejected and logged as a security event

Mitigation 2: No Prompt Passthrough

The job never contains prompts, instructions, or commands for the agent. OpenClaw uses the structured data fields (brandName, niche, location) to fill in its OWN hardcoded workflow. The agent's behavior is defined by SCRIBE-WALKER-CONTEXT.md, not by job data.

Think of it as: generateArticles(niche="plumbing", location="Salt Lake City") — a function call with typed parameters.

Mitigation 3: Separate DB Users

  • App DB user: Write access to jobs only. Read access to articles, sites, users. No access to system collections.
  • OpenClaw DB user: Full access to jobs, articles. Read access to sites, users.
  • Even with compromised app credentials, attacker cannot modify articles or users directly.

Mitigation 4: Rate Limiting

  • Per-user: Max 5 jobs per hour (configurable per plan)
  • Global: Max 20 concurrent processing jobs
  • Retry cap: Max 3 attempts per job, then permanent failure
  • Enforced at both app level (before writing) and OpenClaw level (before processing)

Mitigation 5: Job Signing (Phase 2)

  • App signs each job with HMAC-SHA256 using a shared secret
  • signature = HMAC(jobId + siteId + action + timestamp, SECRET)
  • OpenClaw verifies signature before processing
  • Unsigned or invalid-signature jobs are rejected
  • Protects against direct DB manipulation even with full DB access

Additional Security

  • OpenClaw instance is NOT publicly accessible — no open ports, no API endpoints
  • Only outbound connections: OpenClaw connects TO MongoDB, Anthropic, OpenAI. Nothing connects TO OpenClaw.
  • Job expiry: Jobs older than 1 hour auto-expire (prevents queue poisoning)
  • Audit log: All job state transitions logged with timestamps

Scaling Path

Phase 1: Local Mac Mini (Current — MVP/Beta)

  • Single OpenClaw instance on Taha's Mac Mini
  • Handles 10 beta users easily
  • Scribe Walker already proven (1,285+ articles)
  • Limitation: tied to local machine uptime

Phase 2: AWS EC2 (Production Launch — In Progress)

  • AWS EC2 t3.small (us-east-1), Ubuntu 24.04 LTS
  • Scribe Walker as main OpenClaw agent (not sub-agent)
  • OpenClaw gateway service (systemd, loopback)
  • Prompt intelligence in worker/prompts/*.js (version-controlled)
  • Worker routing (#67) for parallel testing with Mac Mini
  • Cost: ~$20/mo (t3.small)

Phase 3: Multi-Instance (Scale)

  • Multiple OpenClaw instances polling the same job queue
  • MongoDB's findOneAndUpdate with atomic status transitions prevents double-processing
  • Each instance picks up different jobs — natural load balancing
  • Can scale horizontally by adding VMs
  • Trigger: when single instance can't keep up with job volume

Why Not Mac VM?

  • Mac VMs are expensive ($100-200+/mo via MacStadium/AWS)
  • Scribe's article generation doesn't need macOS-specific features
  • No iMessage, no Apple Contacts, no macOS UI automation needed
  • Linux gives us everything: Node.js, headless browser, API access
  • Decision: Linux VM for production

Scribe Walker Integration

What Makes Scribe Walker Output Great

The quality comes from the agentic orchestration, not just the model:

  1. Research phase — Agent browses web, checks trends, finds angles
  2. Topic differentiation — Checks existing articles to avoid duplicates
  3. Writing with reasoning — Claude Opus reasons about structure, SEO, audience
  4. Image matching — Agent crafts DALL-E prompts that specifically match article content
  5. Quality gate — Self-checks word count, SEO meta completeness, no banned patterns
  6. Context awareness — Uses SCRIBE-WALKER-CONTEXT.md for consistent style/rules

Replicating for Multi-Tenant

  • Each job spawns an isolated Scribe Walker session (sub-agent)
  • Session receives: brand context (name, niche, location) + SCRIBE-WALKER-CONTEXT.md base rules
  • Sessions are isolated — one user's generation doesn't affect another's
  • Single agent, multiple sessions — not multi-agent (simpler, sufficient for MVP)

Context Transfer Checklist (for VM migration)

  • seo/SCRIBE-WALKER-CONTEXT.md — writing rules, quality gates, image procedures
  • OpenClaw config (openclaw.json) — agent settings, auth profiles
  • Anthropic auth (setup-token or API key)
  • OpenAI API key (for DALL-E)
  • MongoDB connection string
  • Any learned patterns from memory/ files relevant to article quality

Open Questions

  1. Polling interval: How often should OpenClaw check for new jobs? 5s? 10s? Webhook-triggered?
  2. Article count per plan: Free tier gets 10+5/mo — do we enforce this at app level, OpenClaw level, or both?
  3. Concurrent generation: Should we limit to 1 job at a time per instance, or allow parallel sessions?
  4. Error handling UX: What does the user see if generation fails? Auto-retry? Manual retry button?
  5. Image storage: Resolved — Vercel Blob CDN with sharp JPEG Q85 compression (#51)
  6. Research depth: Resolved — Tiered: Free = evergreen only, Pro = seasonal, Scale = web research
  7. Subdomain SSL: Resolved — Migrated to subfolder model (Mar 9, 2026). No wildcard certs needed.

Decision Log

Date Decision Rationale
2026-03-04 MongoDB as job queue (not REST API) Decoupled, no direct access to OpenClaw, easier to scale
2026-03-04 Typed job schema, no prompt passthrough Security — prevents command injection via DB
2026-03-04 Linux VM over Mac VM for production Cheaper, sufficient features, Scribe doesn't need macOS
2026-03-04 Single agent, multiple sessions Simpler than multi-agent, sufficient for MVP scale
2026-03-04 Claude Opus 4.6 for all tiers Quality first, cost modeling later
2026-03-04 DALL-E 3 for featured images Proven quality from 1,285+ articles on tahaabbasi.com
2026-03-07 Scribe Walker as main agent on EC2 Full OpenClaw lifecycle (compaction, hooks, model updates) without custom plumbing
2026-03-07 Migration + evolution, not 1:1 copy Mac prototype proven; EC2 must incorporate all quality intelligence patterns
2026-03-07 Prompt modules as centralized intelligence worker/prompts/*.js = single source of truth for quality rules across article gen + image regen
2026-03-07 System-level systemd for gateway openclaw gateway install fails over SSH; manual unit file matches Hetzner docs pattern
2026-03-07 80% evergreen / 20% seasonal-timely Evergreen is the backbone for local business SEO; trending = "relevant now" not news slop
2026-03-07 Exact/near-exact dedup only (no contextual) Contextual dedup backfires — businesses WANT multiple articles on same topic from different angles
2026-03-07 Dedup window scales with plan Free=all(15), Pro=all(50), Scale=last 100, Agency=configurable
2026-03-07 Rename Business tier to Scale (#64) Better name for the 150 articles/mo tier
2026-03-07 Services field for Pro+ only (#65) Free stays frictionless; Pro+ gets targeted articles via services list
2026-03-07 Prompt modules stay in JS files Security > hot-reload. Deploy = git pull + restart. No DB-stored prompts.
2026-03-07 Stateless agent (no persistent memory) Consistent with proven Mac Mini pattern. Each job independent.
2026-03-07 Worker routing for migration testing (#67) Default = EC2, ?worker=local = Mac Mini. Temporary.
2026-03-07 IndexNow with dev mode gate (#66) Submit for subdomains, defer custom domains, never submit in dev/test
2026-03-09 Subfolder model over subdomains (#80) New domain needs consolidated SEO authority; every article under tryscribe.co/{brand}/ strengthens root domain
2026-03-09 Separate content site from dashboard tryscribe.co = content + marketing, app.tryscribe.co = dashboard. Same Vercel project, middleware-separated. Dashboard deploys don't risk content site.
2026-03-09 No 301 redirects for old subdomains (test data) Test sites only, no shared links exist. Legacy subdomain middleware handles any stray hits with 301.

Scribe Walker Agent Architecture

Added: 2026-03-07 — Documents the evolution from prototype to production


Origins: Mac Mini Prototype

The Scribe Walker concept was proven on Taha Abbasi's Mac Mini, where it operated as a sub-agent within the "Walker Posse" — a family of specialized agents orchestrated by Benny J Walker (the primary OpenClaw agent).

How it worked on Mac Mini:

  • Benny (main agent) ran cron jobs that spawned ephemeral Scribe Walker sessions
  • Each session received a task message + the full seo/SCRIBE-WALKER-CONTEXT.md (~700 lines)
  • The session wrote articles, published them, and terminated
  • Benny's own agent backing (SOUL.md, MEMORY.md, identity, reliability patterns) provided implicit quality
  • OpenClaw managed session lifecycle, compaction, error handling

What made it effective (proven over 1,285+ articles on tahaabbasi.com):

Capability How It Worked Why It Mattered
Quality gates Word count enforcement (1000+ min), pre-publish checklist, self-review Prevented thin/low-quality content from going live
Duplicate prevention Last 100 titles checked contextually (not just exact slug match) Avoided writing "Why Microneedling Works" 4 articles apart
Brand SEO integration Brand in title, first paragraph, 3-5x naturally, CTAs, author bio Core product value — what makes Scribe different from generic AI
Topic research Industry awareness, seasonal relevance, niche-specific trends Timely articles supplement strong evergreen foundation
Image-topic matching DALL-E prompts crafted to match specific article content, not generic Featured images that actually represent the article topic
Content restrictions Configurable no-go list (topics already published, off-brand content) Prevented brand damage and redundancy
Writing style enforcement No em dashes, no "crucial"/"utilize", varied sentence length, human voice Articles read as human-written, not AI-generated
Readability & engagement Conversational tone, relatable scenarios, questions for flow, white space Readers actually finish articles, not bounce
Source attribution All claims linked to credible sources, original synthesis required SEO authority, no plagiarism risk
CTA structure Every article ends with warm, varied call-to-action Drives business for the brand

Evolution: EC2 Production Architecture

The EC2 deployment is NOT a 1:1 migration. It evolves the prototype into a multi-tenant product where Scribe Walker is the main agent on its own dedicated server.

Key Architectural Shift

MAC MINI (Prototype):
  Benny (main) → spawns ephemeral Scribe Walker → single brand (Taha)

EC2 (Production):
  Scribe Walker (main) → spawns article sessions → any brand (multi-tenant)

Scribe Walker on EC2 is equivalent to what Benny is on the Mac Mini — the primary agent with full OpenClaw capabilities: identity, memory, session management, compaction, hooks, model updates.

Why Main Agent (Not Sub-Agent)

Benefit Description
Full OpenClaw lifecycle Compaction, session memory, command logging — all built-in
Model updates for free New Claude/OpenAI models = openclaw onboard update, no code changes
Security updates OpenClaw security patches apply directly
Monitoring openclaw health, openclaw status, gateway dashboard
Identity persistence SOUL.md, AGENTS.md define consistent behavior across all sessions
Hook system command-logger for diagnostics, session-memory for compaction resilience

Component Architecture

graph TD
    subgraph EC2["🖥️ AWS EC2 (t3.small, us-east-1)"]
        subgraph SystemD["systemd Services"]
            GW["openclaw-gateway.service"]
            WK["scribe-worker.service (Scroll Worker)"]
        end

        subgraph OpenClaw["🤖 OpenClaw Gateway"]
            MainAgent["Scribe Walker (main agent)"]
            SOUL["SOUL.md — Identity & Principles"]
            AGENTS["AGENTS.md — Security & Operations"]
            Hooks["Hooks: command-logger, session-memory"]

            MainAgent --> ArticleSession1["Article Session (Brand A)"]
            MainAgent --> ArticleSession2["Article Session (Brand B)"]
            MainAgent --> ArticleSession3["Article Session (Brand C)"]
        end

        subgraph Worker["📜 Scroll Worker (job-worker.js)"]
            Poller["MongoDB Poller"]
            PromptBuilder["buildScribePrompt()"]
            Modules["Prompt Modules"]
        end

        WK --> Worker
        GW --> OpenClaw
        Poller -->|"openclaw agent --agent main"| MainAgent
        PromptBuilder --> Modules
    end

    subgraph PromptModules["📝 Prompt Intelligence (worker/prompts/)"]
        AW["article-writing.js — Orchestration"]
        QR["quality-rules.js — Quality gates, readability, CTA"]
        DI["dalle-image.js — Image generation rules"]
        TG["tags.js — Standard tag taxonomy"]
    end

    subgraph MongoDB["🗄️ MongoDB Atlas"]
        Jobs["jobs collection"]
        Articles["articles collection"]
        Sites["sites collection — brand config"]
    end

    subgraph External["🌐 External APIs"]
        Claude["Claude Opus 4.6"]
        DallE["DALL-E 3"]
        WebSearch["Web Search (topic research)"]
    end

    Poller -->|"poll pending jobs"| Jobs
    Sites -->|"brand, niche, location, tone, demographics"| PromptBuilder
    PromptBuilder -->|"assembled prompt"| Poller
    ArticleSession1 --> Claude
    ArticleSession1 --> DallE
    ArticleSession1 --> WebSearch
    ArticleSession1 -->|"write completed articles"| Articles
    Modules --> PromptModules
Loading

Intelligence Layers

The Scribe Walker's article-writing intelligence is distributed across four layers:

Layer 1: Agent Identity (OpenClaw Workspace)

Files in the agent's workspace directory that define WHO the agent is:

File Purpose
SOUL.md Core identity, principles, writing philosophy
AGENTS.md Security rules, operational boundaries, allowed/disallowed actions
IDENTITY.md Name, role, platform context
TOOLS.md Environment details, available tools

These are loaded by OpenClaw for every session. They provide the persistent "personality" and guardrails.

Layer 2: Prompt Modules (Code — worker/prompts/)

Centralized, version-controlled prompt components assembled per-job:

Module What It Contains Used By
article-writing.js Main orchestration prompt, workflow, MongoDB instructions Article generation
quality-rules.js Word count, readability, engagement rules, CTA format, brand SEO Article generation, regeneration
dalle-image.js Image style rules, demographic matching, size/format requirements Article generation, image regeneration
tags.js Standard tag taxonomy Article generation

Key design: These modules are the single source of truth for quality rules. Both article generation and image regeneration call the same functions, ensuring consistency.

Layer 3: Site Configuration (MongoDB)

Per-customer data that customizes each job:

interface SiteConfig {
  brandName: string;           // "Sally's Spa"
  niche: string;               // "Med Spa"
  location?: string;           // "Daybreak, South Jordan, UT"
  tone?: string;               // "professional" | "casual" | "authoritative"
  topicStyles: string[];       // ["how-to", "tips", "why"]
  website?: string;            // "https://sallysspa.com"
  socials?: {                  // Social media links for CTAs
    facebook?: string;
    instagram?: string;
    x?: string;
  };
  demographicProfile?: {       // For image generation demographic matching
    primaryDemo: string;       // "caucasian women"
    diversity: string;         // "moderate"
    region: string;            // "suburban"
    typicalAge: string;        // "30-55"
    notes?: string;
  };
  contentRestrictions?: {      // Things the brand does NOT offer/want
    excludeTopics?: string[];  // ["botox", "surgery"]
    excludeCompetitors?: string[];
    requiredDisclosures?: string[];
  };
}

Layer 4: Quality Intelligence (Agent Behavior — To Be Enhanced)

These are the proven patterns from the Mac Mini that must be incorporated as agent-level capabilities, not just prompt text:

4a. Duplicate Prevention

Problem: Without dedup, the agent writes "5 Benefits of Microneedling" every few runs.

Mac Mini approach: Fetch last 100 titles + slugs, contextual matching (not just exact), reject topic-level duplicates.

EC2 approach (refined):

  • Before writing, query MongoDB for the site's existing article titles
  • Exact/near-exact title match ONLY — "Why Microneedling Works" and "Why Microneedling Works!" = duplicate. But "Why Microneedling Works" and "Benefits of Microneedling for Your Skin" = ALLOWED (different angle, both valuable)
  • No contextual/semantic dedup — this backfires. Businesses WANT multiple articles covering the same topic from different angles. A med spa should have articles about microneedling benefits, preparation, aftercare, comparisons, etc.
  • Dedup window scales with plan: Free (15 articles) = check all. Pro (50) = check all. Scale (150) = last 100 cap. Agency = configurable.
  • Token cost: Titles only, ~500 tokens for 50 titles. Negligible.
  • Implementation: Title matching done in code (Scroll Worker / job-worker.js), NOT passed to Claude. Avoids Claude being overly conservative.
4b. Topic Research & Awareness

Problem: Generic articles are fine but timely, relevant articles drive more traffic.

Mac Mini approach: Web searches for breaking news in the niche before each run.

EC2 approach (tiered):

  • Content mix: 80% evergreen / 20% seasonal-timely. Evergreen is the backbone for local business SEO. "How to Choose the Right Roofing Material" has value for years. Trending = "relevant to their customers right now" (e.g., "Spring Roof Maintenance Checklist"), NOT news slop.
  • Free tier: No web research. Evergreen articles only (cheaper, still high quality).
  • Pro tier: Light seasonal awareness (time of year, common seasonal topics for niche).
  • Scale/Agency: Web research enabled for timely content alongside evergreen.
  • Business-specific (Pro+ with services field, see #65): Only write about services/products the brand actually offers. Free tier writes generically about the niche without claiming the brand offers specific services.
4c. Brand SEO Integration

Problem: Articles without strong brand presence don't build SEO authority.

Mac Mini approach: Brand name in title, first paragraph, 3-5x naturally, backlink CTA, author bio.

EC2 approach (carried forward — already in quality-rules.js):

  • Brand name in article title (when it fits naturally)
  • Brand mentioned in first paragraph as the local expert
  • Brand in SEO meta description
  • Brand + location combos 2-3x naturally throughout
  • CTA section at article end with website/social links
  • NOT over-stuffed — natural and helpful
4d. Writing Quality Enforcement

Mac Mini approach: Extensive checklist, word count verification, style rules.

EC2 approach (carried forward — already in quality-rules.js):

  • Minimum 1200 words (target 1200-1800)
  • No em dashes, no "crucial"/"utilize"
  • Varied sentence length, conversational tone
  • Relatable scenarios, questions for flow
  • Subheadings, bullets, white space for readability
  • Original synthesis — not copied from sources
4e. Image-Topic Matching

Problem: Generic stock-photo-style images that don't match the article topic.

Mac Mini approach: Detailed DALL-E prompts describing the specific subject, never brand names (DALL-E blocks them).

EC2 approach (carried forward — already in dalle-image.js):

  • Prompts crafted to match specific article content
  • Describe distinctive visual features instead of brand names
  • Demographic matching when profile available
  • 1792x1024 landscape, realistic stock photo style
  • No text, logos, or watermarks
4f. Post-Publish Actions

Mac Mini approach: IndexNow ping, published log, delivery announce.

EC2 approach (see #66):

  • Update article status in MongoDB (already done)
  • Email notification to site owner (already done via Resend)
  • Subdomain articles (*.tryscribe.co): Submit to tryscribe.co Google Search Console, Bing Webmaster, IndexNow
  • Custom domain articles: Separate workflow, deferred until #7 ships
  • ⚠️ DEV MODE GATE: All search submissions gated behind NODE_ENV=production AND ENABLE_SEARCH_SUBMISSION=true. Both must be true. No test articles in search indices.
  • Analytics tracking (future)
4g. Quality Check (22-Point SEO Audit)

The 22-point quality check is the product's quality standard. This is what differentiates Scribe from AI slop generators. Every article MUST pass this checklist before publishing.

Source of truth: platform/docs/SEO-QUALITY-CHECKLIST.md (replicated from tryscribe.co/seo-guidelines.html — update both when changing).

The full checklist must be incorporated into quality-rules.js as the authoritative quality gate.

4h. Services & Content Restrictions (Pro+ — #65)

Data model:

// In MongoDB site config
services?: string[];           // From curated niche list + custom approved
pendingServices?: string[];    // Custom entries awaiting admin review
excludeServices?: string[];    // Services to explicitly avoid

Prompt pattern: DB stores DATA (list of services). JS stores RULES (how to use that data).

  • Prompt: "Only write about services this brand offers: ${services}. Never claim they offer unlisted services."
  • If no services configured (free tier): "Write generally about the niche without making specific claims about what this brand offers."

Security:

  • Curated services list per niche category, plus "Other" free-text
  • Automated blocklist on submission (illegal/inappropriate terms)
  • Niche-mismatch soft flag: custom service doesn't match niche → flag for admin, don't block
  • Custom "Other" entries go to pendingServices — NOT in prompts until admin-approved
  • Attack surface: users can TYPE anything, but unapproved entries never affect article output

Prompt Assembly Flow

Job arrives from MongoDB
        │
        ▼
buildScribePrompt(job)
        │
        ├── Brand details (from job.params / site config)
        ├── Article IDs to update (from job.articleIds)
        ├── buildTagsBlock() — standard tag taxonomy
        ├── buildBrandSeoBlock() — brand SEO integration rules
        ├── Article structure template
        ├── buildQualityBlock() — quality rules, readability, engagement
        ├── buildDalleRulesBlock() — image generation rules
        └── Image upload API instructions
        │
        ▼
Complete prompt sent to:
  openclaw agent --agent main --message <prompt>
        │
        ▼
OpenClaw spawns article session with:
  - Agent identity (SOUL.md, AGENTS.md)
  - Assembled prompt (from buildScribePrompt)
  - Tools: MongoDB access, DALL-E API, web search, image upload API
        │
        ▼
Agent executes autonomously:
  1. Research trending topics for niche/location
  2. Check last 100 titles for duplicates
  3. Write articles with full quality gates
  4. Generate matched DALL-E images
  5. Upload images via CDN API
  6. Update article docs in MongoDB
  7. Report completion

File Layout on EC2

/home/ubuntu/
├── scribe/                          # Git repo (tryscribeco/scribe)
│   ├── platform/                    # Next.js app (deployed to Vercel)
│   │   └── docs/
│   │       └── ARCHITECTURE.md      # This document
│   └── worker/
│       ├── job-worker.js            # Scroll Worker — job poller + session spawner
│       └── prompts/                 # Prompt intelligence modules
│           ├── article-writing.js   # Main prompt builder
│           ├── quality-rules.js     # Quality, CTA, brand SEO
│           ├── dalle-image.js       # Image generation rules
│           └── tags.js              # Standard tag taxonomy
│
├── .openclaw/
│   ├── openclaw.json                # OpenClaw config (main agent = Scribe Walker)
│   ├── workspace/                   # Agent workspace
│   │   ├── SOUL.md                  # Scribe Walker identity
│   │   ├── AGENTS.md                # Security rules, operational boundaries
│   │   ├── IDENTITY.md              # Name, role, platform
│   │   ├── TOOLS.md                 # Environment details
│   │   └── HEARTBEAT.md             # No proactive tasks (headless)
│   └── agents/
│       └── main/
│           ├── agent/
│           │   └── auth-profiles.json  # Anthropic auth
│           └── sessions/               # Session history
│
├── /etc/scribe/.env                 # Secrets (root:root, 600)
└── /etc/systemd/system/
    ├── openclaw-gateway.service     # OpenClaw gateway (always running)
    └── scribe-worker.service        # Scroll Worker (always running)

Migration Status

Phase Status Details
AWS Account Foundation (#63) ✅ Complete Org, IAM Identity Center, Production account, budget
EC2 Provisioning ✅ Complete t3.small, Ubuntu 24.04, hardened, Elastic IP
Node.js + Repo ✅ Complete Node 22, npm install, env file
OpenClaw Install + Onboard ✅ Complete v2026.3.2, Opus 4.6, hooks enabled
OpenClaw Gateway Service ✅ Complete System-level systemd, RPC probe OK
Agent Workspace (initial) ✅ Complete SOUL.md, AGENTS.md deployed
Architecture Review 🔄 In Progress This document — awaiting approval
Agent Workspace (enhanced) ⬜ Pending Incorporate quality intelligence from architecture
Scroll Worker Service ✅ Complete systemd unit for job-worker.js (scribe-worker.service)
Cutover & Testing ⬜ Pending Test articles, 24h monitor, kill Mac worker
Post-Migration Hardening ⬜ Pending Structured logging, graceful shutdown, alerting

Resolved Questions (Architecture Review — Mar 7, 2026)

# Question Decision
1 Topic research scope Tiered: Free = none (evergreen only). Pro = seasonal awareness. Scale/Agency = web research.
2 Duplicate prevention window Scales with plan: Free = all (15). Pro = all (50). Scale = last 100. Agency = configurable. Exact/near-exact match only.
3 Content restrictions storage DB for data (services list), JS for rules (how to use that data). Custom entries require admin approval.
4 IndexNow Submit for tryscribe.co subdomains. Custom domains deferred to #7. Dev mode gate required. See #66.
5 Prompt module updates Keep in JS files. Deploy = git pull + restart (~10 sec). Security > hot-reload convenience.
6 Memory across jobs Stateless. Each job independent. Mac Mini was stateless too (ephemeral sessions). Consistent.

Testing Strategy (Migration — #67)

Worker routing via job document field:

  • Default (no field): EC2 picks up the job
  • worker: "local": Mac Mini picks up the job (triggered via ?worker=local API param)
  • Both workers run simultaneously during migration validation
  • Compare article quality side by side
  • Remove routing code after EC2 validated and Mac worker decommissioned