-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Timeout control is currently awkwardly split across:
- target config (
targets.yaml->timeout_seconds) - CLI flag (
agentv eval --agent-timeout) - post-hoc latency/duration evaluators (
latency,execution-metrics.max_duration_ms)
This makes it hard to express evaluation intent at the test level.
Problem
For eval authors, the natural place to say:
- "this test may take up to 1 hour"
- "this test should be terminated after 10 minutes"
- "this test may complete after 10 minutes, but should be scored down if it exceeds 2 minutes"
is the eval YAML test itself, not the target definition.
Today:
timeout_secondson targets applies provider-wide and mixes transport/runtime concerns with model identity--agent-timeoutis run-level, not test-levellatency/execution-metrics.max_duration_msare evaluation-only; they do not terminate the run
Request
Add first-class per-test execution timeout support in eval YAML, and optionally a separate hard-stop latency threshold.
Example shape:
tests:
- id: slow-refactor
timeout_seconds: 3600 # allow up to 1 hour for this test
evaluators:
- type: execution-metrics
max_duration_ms: 300000 # soft scoring threshold (5 min)Or, if you want to separate soft vs hard explicitly:
tests:
- id: slow-refactor
timeout_seconds: 3600 # hard execution cap
max_execution_ms: 600000 # optional hard-stop alias / explicit name
evaluators:
- type: latency
threshold: 120000 # soft scoring thresholdWhy this helps
- Keeps target definitions focused on provider identity/config rather than eval policy.
- Lets different tests in the same suite have different execution budgets.
- Avoids duplicating targets just to vary timeout.
- Distinguishes clearly between:
- hard termination policy
- soft latency scoring
Additional suggestion
If there is concern about naming overlap, I think the conceptual split should be:
- hard stop: terminate execution if threshold exceeded
- soft latency judge: score based on observed duration after completion
That would make the existing latency/duration evaluators complementary instead of overloaded.
Current state observed
Relevant existing pieces:
- target-level timeout parsing in
packages/core/src/evaluation/providers/targets.ts - CLI/global timeout via
agentv eval --agent-timeout - post-hoc duration scoring in:
packages/core/src/evaluation/evaluators/latency.tspackages/core/src/evaluation/evaluators/execution-metrics.ts
So this seems like a missing YAML execution-policy layer rather than missing timing primitives.
Happy to help sketch schema/API direction if useful.