Skip to content

Latest commit

 

History

History
249 lines (182 loc) · 6.85 KB

File metadata and controls

249 lines (182 loc) · 6.85 KB

LLM Inference

Agent Check uses LLM inference as a runtime assistant for executing an existing testing plan. It does not use the LLM to author the plan.

The durable artifact remains the provider-neutral YAML plan. LLM calls happen during a run when the runner needs semantic interpretation, judgement, classification, or refinement suggestions.

Where It Fits

LLM inference is used for four runtime tasks:

Task When it happens Output
Candidate resolution A candidate is semantic, task, intent, or visual. A provider-neutral executable candidate, such as exact or structural.
Assertion judgement Provider verification cannot deterministically decide a semantic or visual condition. A pass/fail judgement with confidence and reason.
Failure classification A step fails and needs a final class. app_bug, plan_unclear, or environment.
Refinement suggestions A run discovers better anchors or clearer plan wording. Provider-neutral refinements stored in run artifacts.

The provider still performs the actual app work. The LLM does not call Playwright, terminal APIs, desktop automation APIs, or app-specific code.

Agent Check progressive hardening loop

Model Configuration

Runtime config lives outside the plan in agent-check.config.yaml:

llm:
  enabled: true
  model: codex-cli/gpt-5.5

Model precedence is:

--model > AI_MODEL from .env/environment > agent-check.config.yaml

Disable LLM calls for deterministic runs:

agent-check run examples\web-headed-exact-candidates.plan.yaml --no-llm

Run with an explicit model:

agent-check run examples\web-headed-semantic-llm.plan.yaml --config agent-check.config.yaml --model zhipu/glm-4.7 --run-id web-semantic-zhipu

.env

The CLI imports dotenv/config, so a .env file in the working directory is loaded automatically.

Example:

AI_MODEL=zhipu/glm-4.7
AI_LLM_TIMEOUT_MS=60000
ZHIPU_API_KEY=...

Do not commit .env or API keys.

Supported Prefixes

Agent Check currently routes model strings by prefix:

Prefix Backend Notes
codex-cli/ Codex CLI community provider Uses the locally configured Codex CLI account.
zhipu/ Zhipu AI / BigModel Uses ZHIPU_API_KEY.
zai/ Z.AI endpoint Uses ZHIPU_API_KEY and https://api.z.ai/api/paas/v4.

Examples:

agent-check run examples\web-headed-semantic-llm.plan.yaml --model codex-cli/gpt-5.5
agent-check run examples\web-headed-semantic-llm.plan.yaml --model zhipu/glm-4.7
agent-check run examples\web-headed-semantic-llm.plan.yaml --model zai/glm-4.7

Unsupported prefixes fail as controlled LLM provider errors and are classified as environment when they block a run.

Environment Variables

Common variables:

Variable Purpose
AI_MODEL Default model when --model is not passed.
AI_LLM_TIMEOUT_MS Timeout for each LLM request.
ZHIPU_API_KEY API key for zhipu/ and zai/ models.
ZHIPU_BASE_URL Optional override for the Zhipu provider base URL.

Codex CLI provider variables:

Variable Purpose
CODEX_CLI_PATH Optional path to the Codex CLI executable.
CODEX_CLI_REASONING_EFFORT Reasoning effort passed to the provider.
CODEX_CLI_APPROVAL_MODE Approval mode passed to the provider.
CODEX_CLI_SANDBOX_MODE Sandbox mode passed to the provider.
CODEX_CLI_VERBOSE Enables provider verbosity when supported.

Candidate Resolution

When a step has a high-level candidate, the runner observes the app first and then asks the LLM to resolve the candidate into a provider-neutral executable candidate.

For the full candidate model and guidance on when to use each level, see CANDIDATES.md.

app observation
  + plan context
  + step goal
  + operation
  + candidate
  + provider capabilities
  -> resolved candidate

Example input candidate:

semantic:
  instruction: Find the project name field.

Possible resolved candidate:

exact:
  stableId: project-name-input

or:

structural:
  role: textbox
  name: Project name

The resolved candidate is then passed to the provider. The provider decides whether it can handle it and performs the operation.

Assertion Judgement

Semantic and visual assertions can be judged by the LLM when a provider cannot deterministically prove them.

success:
  any:
    - semantic:
        intent: The project was created successfully.

The runner provides the current observation and asks for a structured judgement:

  • passed
  • confidence
  • message
  • optional evidence notes

Low-confidence judgement is treated as a failed verification rather than a silent pass.

Failure Classification

The runner emits one failure class:

Class Meaning
app_bug The app did not meet a user-visible acceptance criterion.
plan_unclear The plan is ambiguous or underspecified.
environment Provider, app, browser, terminal, desktop session, or LLM backend failed.

If the LLM backend itself fails, the runner records that as an environment problem. For example, an expired account, missing API key, unsupported model prefix, or request timeout should not be reported as an app bug.

Trace Files

Every run writes:

.agent-check/runs/<runId>/llm-trace.json

This file is the easiest way to confirm that the LLM was actually used.

Event types include:

Event Meaning
candidate_resolution A high-level candidate was resolved or failed to resolve.
assertion_judgement A semantic or visual assertion was judged.
failure_classification A failed step was classified.

Statuses include:

Status Meaning
used The LLM returned a structured result.
unavailable The LLM was needed but unavailable or failed.
ignored The LLM result was not accepted, usually because confidence was too low.

Example excerpt:

[
  {
    "event": "candidate_resolution",
    "status": "used",
    "stepId": "fill_project_name",
    "model": "zhipu/glm-4.7",
    "resolvedCandidate": {
      "exact": {
        "stableId": "project-name-input"
      }
    },
    "confidence": 1,
    "reason": "The observation contains a stable project-name input."
  }
]

Implementation Files

Relevant files:

File Role
src/llm/llmClient.ts Provider-neutral LLM interface and structured result schemas.
src/llm/vercelAiLlmClient.ts Vercel AI SDK backed implementation.
src/core/runner.ts Decides when to call the LLM and records trace events.
src/cli/index.ts Loads .env, reads model config, and wires the client into runs.

The Vercel AI SDK is the abstraction layer. OpenAI, Codex CLI, Zhipu, Z.AI, or future providers are runtime model choices, not testing plan concepts.