Feature request: support per-test execution timeouts and optional hard-stop latency thresholds

## Summary

Timeout control is currently awkwardly split across:

- target config (`targets.yaml` -> `timeout_seconds`)
- CLI flag (`agentv eval --agent-timeout`)
- post-hoc latency/duration evaluators (`latency`, `execution-metrics.max_duration_ms`)

This makes it hard to express evaluation intent at the test level.

## Problem

For eval authors, the natural place to say:

- "this test may take up to 1 hour"
- "this test should be terminated after 10 minutes"
- "this test may complete after 10 minutes, but should be scored down if it exceeds 2 minutes"

is the eval YAML test itself, not the target definition.

Today:

- `timeout_seconds` on targets applies provider-wide and mixes transport/runtime concerns with model identity
- `--agent-timeout` is run-level, not test-level
- `latency` / `execution-metrics.max_duration_ms` are evaluation-only; they do not terminate the run

## Request

Add first-class **per-test execution timeout** support in eval YAML, and optionally a separate **hard-stop latency threshold**.

Example shape:

```yaml
tests:
  - id: slow-refactor
    timeout_seconds: 3600   # allow up to 1 hour for this test
    evaluators:
      - type: execution-metrics
        max_duration_ms: 300000   # soft scoring threshold (5 min)
```

Or, if you want to separate soft vs hard explicitly:

```yaml
tests:
  - id: slow-refactor
    timeout_seconds: 3600          # hard execution cap
    max_execution_ms: 600000       # optional hard-stop alias / explicit name
    evaluators:
      - type: latency
        threshold: 120000          # soft scoring threshold
```

## Why this helps

1. Keeps target definitions focused on provider identity/config rather than eval policy.
2. Lets different tests in the same suite have different execution budgets.
3. Avoids duplicating targets just to vary timeout.
4. Distinguishes clearly between:
   - hard termination policy
   - soft latency scoring

## Additional suggestion

If there is concern about naming overlap, I think the conceptual split should be:

- **hard stop**: terminate execution if threshold exceeded
- **soft latency judge**: score based on observed duration after completion

That would make the existing latency/duration evaluators complementary instead of overloaded.

## Current state observed

Relevant existing pieces:

- target-level timeout parsing in `packages/core/src/evaluation/providers/targets.ts`
- CLI/global timeout via `agentv eval --agent-timeout`
- post-hoc duration scoring in:
  - `packages/core/src/evaluation/evaluators/latency.ts`
  - `packages/core/src/evaluation/evaluators/execution-metrics.ts`

So this seems like a missing YAML execution-policy layer rather than missing timing primitives.

Happy to help sketch schema/API direction if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: support per-test execution timeouts and optional hard-stop latency thresholds #518

Summary

Problem

Request

Why this helps

Additional suggestion

Current state observed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: support per-test execution timeouts and optional hard-stop latency thresholds #518

Description

Summary

Problem

Request

Why this helps

Additional suggestion

Current state observed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions