Agent Check uses LLM inference as a runtime assistant for executing an existing testing plan. It does not use the LLM to author the plan.
The durable artifact remains the provider-neutral YAML plan. LLM calls happen during a run when the runner needs semantic interpretation, judgement, classification, or refinement suggestions.
LLM inference is used for four runtime tasks:
| Task | When it happens | Output |
|---|---|---|
| Candidate resolution | A candidate is semantic, task, intent, or visual. |
A provider-neutral executable candidate, such as exact or structural. |
| Assertion judgement | Provider verification cannot deterministically decide a semantic or visual condition. | A pass/fail judgement with confidence and reason. |
| Failure classification | A step fails and needs a final class. | app_bug, plan_unclear, or environment. |
| Refinement suggestions | A run discovers better anchors or clearer plan wording. | Provider-neutral refinements stored in run artifacts. |
The provider still performs the actual app work. The LLM does not call Playwright, terminal APIs, desktop automation APIs, or app-specific code.
Runtime config lives outside the plan in agent-check.config.yaml:
llm:
enabled: true
model: codex-cli/gpt-5.5Model precedence is:
--model > AI_MODEL from .env/environment > agent-check.config.yaml
Disable LLM calls for deterministic runs:
agent-check run examples\web-headed-exact-candidates.plan.yaml --no-llmRun with an explicit model:
agent-check run examples\web-headed-semantic-llm.plan.yaml --config agent-check.config.yaml --model zhipu/glm-4.7 --run-id web-semantic-zhipuThe CLI imports dotenv/config, so a .env file in the working directory is
loaded automatically.
Example:
AI_MODEL=zhipu/glm-4.7
AI_LLM_TIMEOUT_MS=60000
ZHIPU_API_KEY=...Do not commit .env or API keys.
Agent Check currently routes model strings by prefix:
| Prefix | Backend | Notes |
|---|---|---|
codex-cli/ |
Codex CLI community provider | Uses the locally configured Codex CLI account. |
zhipu/ |
Zhipu AI / BigModel | Uses ZHIPU_API_KEY. |
zai/ |
Z.AI endpoint | Uses ZHIPU_API_KEY and https://api.z.ai/api/paas/v4. |
Examples:
agent-check run examples\web-headed-semantic-llm.plan.yaml --model codex-cli/gpt-5.5
agent-check run examples\web-headed-semantic-llm.plan.yaml --model zhipu/glm-4.7
agent-check run examples\web-headed-semantic-llm.plan.yaml --model zai/glm-4.7Unsupported prefixes fail as controlled LLM provider errors and are classified
as environment when they block a run.
Common variables:
| Variable | Purpose |
|---|---|
AI_MODEL |
Default model when --model is not passed. |
AI_LLM_TIMEOUT_MS |
Timeout for each LLM request. |
ZHIPU_API_KEY |
API key for zhipu/ and zai/ models. |
ZHIPU_BASE_URL |
Optional override for the Zhipu provider base URL. |
Codex CLI provider variables:
| Variable | Purpose |
|---|---|
CODEX_CLI_PATH |
Optional path to the Codex CLI executable. |
CODEX_CLI_REASONING_EFFORT |
Reasoning effort passed to the provider. |
CODEX_CLI_APPROVAL_MODE |
Approval mode passed to the provider. |
CODEX_CLI_SANDBOX_MODE |
Sandbox mode passed to the provider. |
CODEX_CLI_VERBOSE |
Enables provider verbosity when supported. |
When a step has a high-level candidate, the runner observes the app first and then asks the LLM to resolve the candidate into a provider-neutral executable candidate.
For the full candidate model and guidance on when to use each level, see CANDIDATES.md.
app observation
+ plan context
+ step goal
+ operation
+ candidate
+ provider capabilities
-> resolved candidate
Example input candidate:
semantic:
instruction: Find the project name field.Possible resolved candidate:
exact:
stableId: project-name-inputor:
structural:
role: textbox
name: Project nameThe resolved candidate is then passed to the provider. The provider decides whether it can handle it and performs the operation.
Semantic and visual assertions can be judged by the LLM when a provider cannot deterministically prove them.
success:
any:
- semantic:
intent: The project was created successfully.The runner provides the current observation and asks for a structured judgement:
passedconfidencemessage- optional evidence notes
Low-confidence judgement is treated as a failed verification rather than a silent pass.
The runner emits one failure class:
| Class | Meaning |
|---|---|
app_bug |
The app did not meet a user-visible acceptance criterion. |
plan_unclear |
The plan is ambiguous or underspecified. |
environment |
Provider, app, browser, terminal, desktop session, or LLM backend failed. |
If the LLM backend itself fails, the runner records that as an environment problem. For example, an expired account, missing API key, unsupported model prefix, or request timeout should not be reported as an app bug.
Every run writes:
.agent-check/runs/<runId>/llm-trace.json
This file is the easiest way to confirm that the LLM was actually used.
Event types include:
| Event | Meaning |
|---|---|
candidate_resolution |
A high-level candidate was resolved or failed to resolve. |
assertion_judgement |
A semantic or visual assertion was judged. |
failure_classification |
A failed step was classified. |
Statuses include:
| Status | Meaning |
|---|---|
used |
The LLM returned a structured result. |
unavailable |
The LLM was needed but unavailable or failed. |
ignored |
The LLM result was not accepted, usually because confidence was too low. |
Example excerpt:
[
{
"event": "candidate_resolution",
"status": "used",
"stepId": "fill_project_name",
"model": "zhipu/glm-4.7",
"resolvedCandidate": {
"exact": {
"stableId": "project-name-input"
}
},
"confidence": 1,
"reason": "The observation contains a stable project-name input."
}
]Relevant files:
| File | Role |
|---|---|
src/llm/llmClient.ts |
Provider-neutral LLM interface and structured result schemas. |
src/llm/vercelAiLlmClient.ts |
Vercel AI SDK backed implementation. |
src/core/runner.ts |
Decides when to call the LLM and records trace events. |
src/cli/index.ts |
Loads .env, reads model config, and wires the client into runs. |
The Vercel AI SDK is the abstraction layer. OpenAI, Codex CLI, Zhipu, Z.AI, or future providers are runtime model choices, not testing plan concepts.