Skip to content

Latest commit

 

History

History
89 lines (57 loc) · 7.63 KB

File metadata and controls

89 lines (57 loc) · 7.63 KB

Agents Inc. CLI — V2 Vision

Living document for planning what comes next: finishing the v1 release, and where to go after.


Phase 0: Small Wins Before Release

Polish tasks that need to land before v1 feels complete.

  • README overhaul — fix the logo, fix the screenshots, create a GIF demo (D-109, D-110, D-111)
  • Info panel — replace the ? help overlay with the I info panel showing selected skills/agents in a scope x mode grid (D-144)

Phase 1: Launch (target: Monday 2026-03-30)

Goal: Finish all remaining blockers over the weekend and ship v1 on Monday.

Create a dedicated launch TODO document (todo/TODO-launch.md) — a focused checklist of everything that must happen before and on launch day. Includes:

  • All remaining Phase 0 small wins
  • Startup performance — cache the generated marketplace matrix, only generate/merge the custom skills overlay at runtime (D-97). This is noticeable enough to be a launch blocker.
  • "Skill not in matrix" warnings on global scope — spurious warnings when running on global scope with a small skill selection or custom skills. Needs investigation: test all permutations of global-only, project-only, mixed, with and without custom skills.
  • Rename "local mode" to "eject mode" — "local" vs "plugin" is confusing terminology. "Eject" is immediately understood: you're ejecting the skill into your project for customization (D-156). Must land before launch.
  • Investigate renaming "global" scope to "user" scope — "project" scope is clear, but "global" is overloaded in dev tooling. "User" scope may communicate better that it's per-user, not machine-wide. Lower conviction on this one — decide before launch (D-118).
  • Usage analytics — Track installations and skill/agent selections. Even a private dashboard to start. Investigate options: Amplitude (proper product analytics, funnels, retention), Firebase (quick and simple), PostHog (open-source, self-hostable). Events to capture: CLI install, cc init completion, cc edit completion, each skill selected, each agent selected, scope (global/project). Needs to be opt-in with clear disclosure. Research which tool gets us a dashboard fastest with the least integration effort.
  • Bug fixes that are release-blocking (D-92, D-152, D-123)
  • Any final polish
  • Launch strategy research — investigate how best to launch a dev tool like this: where to post (HN, Reddit, X, Discord communities, etc.), what written content to prepare (blog post? announcement thread?), posting cadence, order of operations
  • Launch day execution steps

Phase 2: Post-Release — Reduce Frustration

Philosophy: The tool already gives a lot. But a few bad experiences can undo all of that. Post-launch priority is eliminating the sharp edges — the moments where someone tries something reasonable and gets burned.

Notes (to be turned into TODOs)

  • Global uninstall safety — Track which projects depend on global-scoped plugins (D-131). When someone uninstalls a global skill, warn them: "These 3 projects will break." Don't let a single cc uninstall silently destroy other setups.
  • General theme: reduce ability to break things — Audit all destructive operations for this pattern. Anywhere a user action in one scope can silently damage another scope, add guardrails. The goal is: no reasonable user action should produce a broken state without a warning.

Web UI — Skills Explorer

A public web app (React) that visualizes the entire skills catalog — think periodic table of elements, but for skills. The flow:

  1. Start with a framework pick — prompt the user to choose a frontend framework (React, Vue, Angular, etc.)
  2. Live constraint visualization — selecting a framework instantly grays out incompatible skills, highlights recommended ones, surfaces setup dependencies. Same logic as the CLI wizard's build step, but visual and explorable.
  3. Browse and select — users explore domains/categories, toggle skills on/off, see the relationships play out in real time
  4. Generate a seed — the final selection produces a shareable seed string (encoded config) that you pass to cc init --seed <value> to bootstrap the exact setup in one command

This serves two purposes: (1) a discovery/marketing tool — people can explore what's available without installing anything, and (2) a practical onboarding shortcut — share a seed link with your team and everyone gets the same setup.

Curated Alternatives & Grading

Each skill in the explorer shows graded alternatives — ~10 competing skills per category (e.g., Zustand vs Jotai vs MobX vs Redux Toolkit for state management). Each alternative is graded on a consistent rubric.

  • I choose which alternatives appear — no user submissions on the explorer side. I test, grade, and upload each alternative myself. This keeps quality high and prevents gaming.
  • Grading rubric has two layers:
    • Clean code quality (~100 universal test cases): Does the skill produce declarative, context-free, idiomatic code? These tests evaluate the output code, not the skill itself — the skill doesn't need to "know" about clean code standards. It either produces clean code or it doesn't.
    • Skill-specific correctness (~10 test cases per skill): Does the skill accomplish its domain goal correctly? Proper store patterns for state management, correct route setup for API frameworks, etc.
  • Composite score displayed per alternative — users can see at a glance which option produces the best code for their stack, and choose between mine or a third-party alternative when building their selection.

Leaderboard — Open Submissions

Separate from the curated explorer: a public leaderboard where anyone can submit skills and agents to be ranked. This tests actual usable code quality — not logic puzzles or test-passing, but the conventions that matter in real codebases (declarative patterns, readability, minimal context dependency, idiomatic TypeScript).

How submissions work

  • Same two-layer scoring as the explorer grading: universal clean code tests + skill-specific tests
  • Anti-cheese strategy: Test cases are partially hidden (public subset for development, private remainder for scoring) and supplemented by generative test cases — templates that produce unique scenarios each run, so submitters can't overfit to a static test suite
  • Manual sign-off required before a submission appears on the public leaderboard. I review flagged submissions and spot-check on novel scenarios. This is the short-term gate; generative tests are the long-term solution.
  • Adversarial probing as a third scoring dimension: curveball prompts, ambiguous requirements, edge cases — a genuinely good skill handles these; a tuned-for-tests skill falls apart

Cost model

Leaning toward a queued free tier: free submissions go into a slow batch queue (processed overnight), paid submissions ($2–5) get immediate results. Submission fee is refunded if the skill scores above a quality threshold — this self-selects for quality and filters spam without killing casual participation.

The alternative of requiring users to bring their own API keys adds too much friction for the value it provides.

IP protection

The test corpus (150+ skills × 10 tests + 100 universal tests) is months of work and encodes taste/judgment. Mitigations:

  • Never expose test content — show scores and pass/fail, never the actual prompts or expected outputs
  • Generative tests mean there's no static suite to steal — just templates
  • Rate limiting prevents systematic extraction via repeated submissions
  • The concept is replicable but the execution is the moat