Skip to content

Feature request: support per-test execution timeouts and optional hard-stop latency thresholds #518

@christso

Description

@christso

Summary

Timeout control is currently awkwardly split across:

  • target config (targets.yaml -> timeout_seconds)
  • CLI flag (agentv eval --agent-timeout)
  • post-hoc latency/duration evaluators (latency, execution-metrics.max_duration_ms)

This makes it hard to express evaluation intent at the test level.

Problem

For eval authors, the natural place to say:

  • "this test may take up to 1 hour"
  • "this test should be terminated after 10 minutes"
  • "this test may complete after 10 minutes, but should be scored down if it exceeds 2 minutes"

is the eval YAML test itself, not the target definition.

Today:

  • timeout_seconds on targets applies provider-wide and mixes transport/runtime concerns with model identity
  • --agent-timeout is run-level, not test-level
  • latency / execution-metrics.max_duration_ms are evaluation-only; they do not terminate the run

Request

Add first-class per-test execution timeout support in eval YAML, and optionally a separate hard-stop latency threshold.

Example shape:

tests:
  - id: slow-refactor
    timeout_seconds: 3600   # allow up to 1 hour for this test
    evaluators:
      - type: execution-metrics
        max_duration_ms: 300000   # soft scoring threshold (5 min)

Or, if you want to separate soft vs hard explicitly:

tests:
  - id: slow-refactor
    timeout_seconds: 3600          # hard execution cap
    max_execution_ms: 600000       # optional hard-stop alias / explicit name
    evaluators:
      - type: latency
        threshold: 120000          # soft scoring threshold

Why this helps

  1. Keeps target definitions focused on provider identity/config rather than eval policy.
  2. Lets different tests in the same suite have different execution budgets.
  3. Avoids duplicating targets just to vary timeout.
  4. Distinguishes clearly between:
    • hard termination policy
    • soft latency scoring

Additional suggestion

If there is concern about naming overlap, I think the conceptual split should be:

  • hard stop: terminate execution if threshold exceeded
  • soft latency judge: score based on observed duration after completion

That would make the existing latency/duration evaluators complementary instead of overloaded.

Current state observed

Relevant existing pieces:

  • target-level timeout parsing in packages/core/src/evaluation/providers/targets.ts
  • CLI/global timeout via agentv eval --agent-timeout
  • post-hoc duration scoring in:
    • packages/core/src/evaluation/evaluators/latency.ts
    • packages/core/src/evaluation/evaluators/execution-metrics.ts

So this seems like a missing YAML execution-policy layer rather than missing timing primitives.

Happy to help sketch schema/API direction if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions