Agent Check is a standalone TypeScript CLI and library for running provider-neutral testing plans. It receives a plan, validates it, executes it through runtime providers, records evidence, classifies failures, and optionally suggests refinements.
It is not a test authoring system. A plan can come from a human, another agent, or any external workflow.
- Keep the testing plan durable and provider-neutral.
- Put concrete automation engines behind provider plugins.
- Use LLM inference only as runtime assistance, not as the source of truth.
- Produce inspectable artifacts for every run.
- Make failures explainable as
app_bug,plan_unclear, orenvironment. - Let the CLI and TypeScript library share the same core implementation.
src/
cli/ command-line entrypoint
config/ runtime config parsing
core/ runner, provider registry, artifacts, result model
llm/ provider-neutral LLM interface and Vercel AI SDK adapter
providers/ built-in runtime providers
schema/ testing plan schema and YAML parsing
examples/ sample plans and fixtures
docs/ architecture, schema, LLM, and provider guides
skills/ agent skill for writing testing plans
Important files:
| File | Responsibility |
|---|---|
src/cli/index.ts |
Implements validate, run, refine, providers, and doctor. |
src/schema/testingPlan.ts |
Defines the Zod schema for testing plans. |
src/config/runtimeConfig.ts |
Loads runtime config from agent-check.config.yaml. |
src/core/runner.ts |
Orchestrates execution, verification, classification, and artifacts. |
src/core/provider.ts |
Defines the provider plugin contract. |
src/core/providerRegistry.ts |
Chooses compatible providers for steps. |
src/core/artifacts.ts |
Writes run artifacts under .agent-check/runs/<runId>/. |
src/llm/llmClient.ts |
Defines structured LLM tasks and result contracts. |
src/llm/vercelAiLlmClient.ts |
Uses the Vercel AI SDK for model-provider abstraction. |
src/providers/index.ts |
Registers built-in providers. |
CLI/library call
-> load runtime config
-> parse and validate testing plan
-> interpolate variables
-> create artifact writer
-> create provider registry
-> create optional LLM client
-> run flow steps in order
-> write result.json, trace.json, trace.jsonl, llm-trace.json
Each flow step follows this lifecycle:
observe app state
-> rank candidates
-> resolve high-level candidates with LLM when needed
-> ask providers canHandle()
-> execute operation
-> observe post-step state
-> verify success
-> collect artifacts and refinements
-> continue or stop on failure
The runner stops at the first failed step. That keeps the failure focused and prevents later steps from producing misleading noise.
The plan describes user-visible behavior:
- app surface, such as
web,tui,desktop,electron, ormock - intent and acceptance criteria
- ordered flow steps
- variables
- candidate resolution levels
- success conditions
- failure and refinement policy
The plan must not name implementation engines such as Playwright, Appium, Selenium, OpenAI, or a specific terminal automation library.
Target fields may include provider-owned launch configuration, such as baseUrl,
headless, browserChannel, or launchCommand. Those values describe how to
reach the app during this run; they are not candidate semantics.
agent-check.config.yaml controls local runtime policy:
artifactStore: .agent-check
execution:
maxRecoveryAttemptsPerStep: 10
llm:
enabled: true
model: codex-cli/gpt-5.5
providers:
mock:
enabled: true
webPlaywright:
enabled: trueThis file can vary by machine, CI job, or workspace. The plan should remain portable across those environments.
Providers translate abstract operations into concrete runtime actions.
export interface ProviderPlugin {
id: string;
name: string;
capabilities(): readonly ProviderCapability[];
observe(input: ProviderHandleInput): Promise<AppObservation>;
canHandle(input: ProviderHandleInput): Promise<ProviderHandleAssessment>;
execute(input: ProviderHandleInput): Promise<ProviderExecutionResult>;
verify(input: ProviderHandleInput): Promise<ProviderVerificationResult>;
dispose?(): Promise<void>;
}Provider responsibilities:
- create or attach to the app surface
- observe visible state
- determine if a candidate can be handled
- execute operations
- verify assertions that can be checked deterministically
- return screenshots, snapshots, logs, or other evidence
- clean up runtime resources
The runner owns sequencing and classification. Providers own surface-specific mechanics.
The provider registry filters providers by:
- target surface
- provider enabled state
- provider
canHandle()confidence - operation and candidate compatibility
The registry lets multiple providers exist for the same surface. For example, a
future web-cdp provider could sit beside web-playwright, and the runner can
choose based on canHandle() scoring.
Candidates are ordered by runtime.resolutionOrder.
For the detailed candidate model, authoring guidance, and examples by level, see CANDIDATES.md.
Common order:
runtime:
resolutionOrder:
- exact
- structural
- semantic
- task
- intent
- visual
- providerHintExact and structural candidates can usually go straight to a provider.
High-level candidates such as semantic, task, intent, and visual are
resolved through the LLM when enabled. The LLM result is still provider-neutral,
for example:
exact:
stableId: project-name-inputor:
structural:
role: button
name: Create ProjectThe provider then translates that candidate into actual browser, terminal, desktop, or Electron actions.
The LLM layer is behind LlmClient. The current implementation uses the Vercel
AI SDK adapter, which allows model-provider selection by model string.
The runner calls the LLM for:
- candidate resolution
- semantic and visual assertion judgement
- failure classification
- refinement suggestions
LLM details live in LLM.md.
Runs write to:
.agent-check/runs/<runId>/
Standard files:
| File | Contents |
|---|---|
result.json |
Final result, failed step, failure class, evidence paths, refinements. |
trace.json |
Structured provider and runner trace. |
trace.jsonl |
Lightweight chronological event log. |
llm-trace.json |
LLM calls, results, confidence, and errors. |
Providers may add screenshots, DOM snapshots, terminal output, desktop window snapshots, or engine traces.
A run produces RunResult.
Key fields:
| Field | Meaning |
|---|---|
status |
passed or failed. |
failedStepId |
First failed step, when any. |
failureClass |
app_bug, plan_unclear, or environment. |
evidence |
Artifact paths relevant to the run. |
trace |
Provider and runner trace entries. |
refinements |
Suggested plan improvements. |
Failure classes are intentionally small:
app_bug: user-visible behavior failed.plan_unclear: the plan did not identify the target or outcome clearly.environment: runtime setup, provider, app launch, browser, terminal, desktop session, or LLM backend failed.
Providers and the LLM can suggest refinements. Examples:
- add an exact stable id candidate discovered during a run
- replace vague wording with a clearer semantic instruction
- add a structural role/name candidate
- clarify an ambiguous assertion
The runner writes refinements into result.json. It does not mutate the plan
during run.
Plan mutation is explicit:
agent-check refine <runId> --plan path\to\plan.yaml --apply| Command | Purpose |
|---|---|
validate <plan.yaml> |
Parse and validate a testing plan. |
run <plan.yaml> |
Execute a plan and write run artifacts. |
refine <runId> |
Inspect or apply stored refinement suggestions. |
providers |
List registered provider capabilities. |
doctor |
Inspect config, environment, and runtime readiness. |
The CLI is a thin layer over the library. Library users can import the parser, runner, providers, and config helpers directly.
Current built-ins:
| Provider id | Surface | Role |
|---|---|---|
mock |
mock |
Deterministic fixture provider for tests and examples. |
web-playwright |
web |
Browser provider for websites, including headed Chrome. |
tui-process |
tui |
Process-backed terminal provider. |
electron-playwright |
electron |
Electron renderer provider. |
windows-desktop |
desktop |
Initial Windows UI Automation provider. |
Provider details live in PROVIDERS.md.
The test suite covers:
- schema validation
- banned provider-specific engine names inside plans
- mixed candidate levels
- passing and failing mock runs
- stop-on-failure behavior
- provider launch/setup failures
- failure classification
- LLM candidate resolution
- LLM assertion judgement
- controlled handling of invalid or low-confidence LLM output
Run:
npm testGood extension points:
- add a new provider in
src/providers - implement a custom provider through the library API
- add new provider-neutral candidate fields to the schema
- add new artifact kinds
- add another Vercel AI SDK model prefix
- add richer refinement application rules
Avoid:
- putting concrete engine names into plans
- making providers author plans
- treating LLM output as unverified execution success
- mutating plans automatically during
run