diff --git a/features/remote-evals/params/README.md b/features/remote-evals/params/README.md new file mode 100644 index 0000000..8c41b8a --- /dev/null +++ b/features/remote-evals/params/README.md @@ -0,0 +1,81 @@ +# Remote Eval Parameters: Overview + +## What Are Eval Parameters? + +**Eval parameters** let users configure evaluator behavior from the Braintrust Playground without changing code. Developers declare named parameters in their evaluator -- anything that affects how the eval runs: a model name, a similarity threshold, a feature flag, a service URL, a max output length, etc. The Playground renders these as UI controls (sliders, text inputs, etc.) and passes the user's chosen values to the evaluator when running. + +This makes it easy to compare how a system behaves under different configurations -- for example, running the same test cases with `temperature: 0.2` vs `temperature: 0.9`, or against a staging vs. production endpoint -- without deploying new code. + +## How It Works + +``` + Braintrust Playground Developer's Machine ++----------------------+ +-------------------------+ +| | | Dev Server | +| GET /list | --------------> | | +| | <-------------- | "food-classifier": | +| Render UI controls: | parameters: | parameters: | +| model: [gpt-4 v] | { model: ..., | model: "gpt-4" | +| temp: [0.7 ---] | temp: ... } | temperature: 0.7 | +| | | | +| User changes model | | | +| to "gpt-4o", clicks | POST /eval | | +| "Run" | --------------> | parameters: | +| | parameters: | { model: "gpt-4o", | +| | { model: | temperature: 0.7 } | +| | "gpt-4o" } | | +| Results stream back | <-------------- | task receives params | ++----------------------+ +-------------------------+ +``` + +1. **Declaration**: The developer declares named parameters in the evaluator definition. Each parameter has a name, optional type, default value, and description. +2. **Discovery**: When the Playground fetches `GET /list`, the dev server includes parameter definitions in the response. The Playground renders appropriate UI controls for each parameter. +3. **Delivery**: When the user clicks "Run", the Playground sends the current parameter values in the `POST /eval` request body under the `"parameters"` key. +4. **Merging**: The dev server merges request values with evaluator defaults (request overrides defaults). This means parameters not changed by the user still have their default values. +5. **Forwarding**: The merged parameters are forwarded to the task function and all scorer functions as they run. + +## Key Concepts + +**Parameter definition** -- A declaration in the evaluator specifying a parameter's name, default value, and optional metadata (type, description). Defined once in code; used to populate UI controls. + +**Parameter values** -- The runtime values the Playground sends per-run. These override any defaults defined in the evaluator. + +**Backward compatibility** -- Tasks and scorers that do not declare they want parameters must continue to work unchanged. The SDK is responsible for filtering parameters out of function calls to functions that don't expect them. + +## Example + +```pseudocode +# Define an evaluator with parameters +evaluator = Evaluator( + task = (input, parameters) => MyModel.classify(input, model: parameters["model"]), + scorers = [ + Scorer("exact_match", (expected, output) => output == expected ? 1.0 : 0.0) + ], + parameters = { + "model": { type: "model", default: "gpt-4", description: "Model to use" }, + "temperature": { type: "data", default: 0.7, description: "Sampling temperature" } + } +) +``` + +The Playground renders a model picker and temperature input. When the user selects "gpt-4o" and clicks "Run", the task receives `parameters = {"model": "gpt-4o", "temperature": 0.7}` (temperature keeps its default since the user didn't change it). + +## Parameters vs. Input + +**Input** is per-case data — each test case has its own `input` value (e.g., `"apple"`, `"carrot"`). It varies case-by-case and represents *what* is being evaluated. + +**Parameters** are per-run configuration — the same values apply to every test case in the run. They represent *how* the evaluator behaves. + +The typical workflow: run the same dataset (same inputs) with different parameter values to compare configurations. For example, run `model: "gpt-4"` and `model: "gpt-4o"` against identical test cases, then compare scores side-by-side in the Playground. + +## Further Reading + +| Document | Purpose | +|----------|---------| +| [design.md](design.md) | End-to-end flow, component roles, and design decisions | +| [contracts.md](contracts.md) | Wire protocol, data types, and API schemas | +| [validation.md](validation.md) | Test scenarios and expected behaviors | + +### Related Specs + +- [Remote Eval Dev Server](../server/README.md) -- The broader remote eval feature this builds on diff --git a/features/remote-evals/params/contracts.md b/features/remote-evals/params/contracts.md new file mode 100644 index 0000000..0769d1c --- /dev/null +++ b/features/remote-evals/params/contracts.md @@ -0,0 +1,176 @@ +# Remote Eval Parameters: Contracts + +## SDK + +### Evaluator + +#### `parameters` + +A map from parameter name to parameter spec, declared in the evaluator definition. This is the source of truth for what parameters exist and what their defaults are. + +```pseudocode +evaluator.parameters = { + "model": { type: "model", default: "gpt-4", description: "Model to use" }, + "temperature": { type: "data", default: 0.7, description: "Sampling temperature" }, + "max_length": { type: "data", default: 100, description: "Max output length" } +} +``` + +Each parameter spec: + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `default` | `any` | No | Value used when the `POST /eval` request does not include this parameter | +| `description` | `string` | No | Human-readable description shown in the Playground UI | +| `type` | `string` | No | Type hint — `"data"` (default), `"model"`, or `"prompt"`. See `parameter` entry under `GET /list` Response Format. | + +#### `task` + +A callable that optionally declares a `parameters` argument. When declared, it receives the merged parameter map (request values overlaid on evaluator defaults) as a plain string-keyed object. + +Tasks that do not declare `parameters` must continue to work unchanged — the SDK must not pass `parameters` to functions that don't accept it. + +**Side effect**: the merged `parameters` map is passed to the task function on every test case invocation during a `POST /eval` run. + +#### `scorers` + +Local scorer functions follow the same contract as `task` with respect to parameters — they optionally declare `parameters` and receive the same merged map if they do. The SDK must not pass `parameters` to scorers that don't declare it. + +Remote scorers (sent by the Playground in the `POST /eval` request) also receive the merged parameters via the SDK's remote scorer invocation mechanism. + +**Side effect**: the merged `parameters` map is passed to every scorer function (local and remote) on every test case invocation during a `POST /eval` run. + +### Dev Server + +#### `GET /list` + +##### Request Format + +No body. Accepts both `GET` and `POST`. + +``` +GET /list +Authorization: Bearer +X-Bt-Org-Name: +``` + +##### Response Format + +``` +HTTP 200 OK +Content-Type: application/json +``` + +Body: a JSON object keyed by evaluator name. For each evaluator, the `parameters` field contains a `parameters` object serialized from the evaluator's `parameters` definition, or `null` if the evaluator defines no parameters. + +```json +{ + "food-classifier": { + "scores": [{ "name": "exact_match" }], + "parameters": { + "type": "braintrust.staticParameters", + "schema": { + "model": { + "type": "data", + "schema": { "type": "string" }, + "default": "gpt-4", + "description": "Model to use" + }, + "temperature": { + "type": "data", + "schema": { "type": "number" }, + "default": 0.7, + "description": "Sampling temperature" + } + }, + "source": null + } + }, + "text-summarizer": { + "scores": [], + "parameters": null + } +} +``` + +**`parameters` object:** + +| Field | Type | Description | +|-------|------|-------------| +| `type` | `string` | Always `"braintrust.staticParameters"` for inline (code-defined) parameters | +| `schema` | `Record` | Map of parameter name to definition | +| `source` | `null` | Always `null` for static parameters. Non-null values reference remotely-stored parameter definitions — out of scope for baseline. | + +When the evaluator defines no parameters, set `"parameters": null` or omit the field. + +> **Note for existing SDK implementors**: Prior to the introduction of the container format, some SDKs returned the `schema` map directly (i.e. `Record`) rather than wrapping it in a `parameters` object with `type` and `source` fields. The container was introduced to distinguish static (inline) parameters from dynamic (remotely-stored) ones. If updating an existing SDK, check whether it predates this format and update accordingly. + +**`parameter` entry** (each value in `schema`): + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `type` | `string` | Yes | `"data"` for generic values; `"model"` for a model picker; `"prompt"` for a prompt editor. For a baseline implementation, `"data"` is sufficient. | +| `schema` | `object` | No | JSON Schema fragment describing the value shape. Set `type` to `"string"`, `"number"`, `"boolean"`, `"object"`, or `"array"` to match the parameter's value type. Used by the Playground to render appropriate input controls. Omit if the type is unknown or mixed. | +| `default` | `any` | No | Default value. Should match the type described by `schema`. | +| `description` | `string` | No | Human-readable description shown in the Playground UI. | + +**Serialization**: each entry in `evaluator.parameters` maps to a `parameter` entry in the `schema` object. The parameter name becomes the key; the spec fields (`default`, `description`, `type`) are preserved as-is. + +##### Error Responses + +| Status | Condition | +|--------|-----------| +| `401 Unauthorized` | Missing or invalid auth token | + +#### `POST /eval` + +##### Request Format + +``` +POST /eval +Content-Type: application/json +Authorization: Bearer +X-Bt-Org-Name: +``` + +The `parameters` field in the request body carries the user's chosen values from the Playground UI: + +```json +{ + "name": "food-classifier", + "data": { ... }, + "parameters": { + "model": "gpt-4o", + "temperature": 0.9 + } +} +``` + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `parameters` | `Record` | No | Parameter values chosen by the user. Keys match the evaluator's parameter names. Absent, `null`, and `{}` all mean no overrides were provided. | + +See the [Dev Server specification](../server/specification.md) for the full `POST /eval` request schema (all fields beyond `parameters`). + +##### Response Format + +An SSE stream. The `parameters` field has no effect on the response format — progress, summary, and done events are the same structure as without parameters. + +See the [Dev Server specification](../server/specification.md) for the full SSE event schema. + +**Side effect**: the merged parameters (request values overlaid on evaluator defaults) are forwarded to the task and all scorers on every test case invocation. Output values in the SSE stream reflect whatever the task produced using those parameters. + +##### Error Responses + +| Status | Condition | +|--------|-----------| +| `400 Bad Request` | `parameters` field is present but not a JSON object | +| `401 Unauthorized` | Missing or invalid auth token | +| `404 Not Found` | No evaluator registered with the given `name` | + +--- + +## References + +- [Braintrust: Remote evals guide](https://www.braintrust.dev/docs/evaluate/remote-evals) +- [Dev Server specification](../server/specification.md) — full `POST /eval` and `GET /list` schemas diff --git a/features/remote-evals/params/design.md b/features/remote-evals/params/design.md new file mode 100644 index 0000000..2222276 --- /dev/null +++ b/features/remote-evals/params/design.md @@ -0,0 +1,159 @@ +# Remote Eval Parameters: Design + +## Overview + +Eval parameters are **declared** in the user's application (in code), **discovered** by the Braintrust UI (via the application's `GET /list`), then finally **executed** by the application (via `POST /eval`). + +``` +Phase 1: Declaration (code) + Developer writes evaluator with parameter specs + | + v +Phase 2: Discovery (GET /list) + Playground fetches evaluator metadata + Dev server serializes parameter specs to JSON schema + Playground renders UI controls + | + v +Phase 3: Execution (POST /eval) + Playground sends user-chosen values in request body + Dev server merges values with evaluator defaults + Dev server runs the eval with merged values +``` + +## Components + +### Evaluator + +*In the SDK...* + +An evaluator defines the template for an eval. It also defines parameters, their defaults, and their descriptions. This is the single source of truth -- the UI reads from it and the runtime uses it to execute evals. + +During `GET /list`, the dev server reads parameter specs from each evaluator and serializes them for the Playground. + +During `POST /eval`, the dev server reads the same specs to extract default values for any parameters the user didn't override. + +### Dev Server + +#### `GET /list` + +Serializes each evaluator's parameter specs into the format the Playground expects. See [contracts.md](contracts.md) for the exact schema. + +The serialization must preserve: + +- Parameter names (keys) +- Types (e.g., `"data"` for generic values) +- Default values +- Descriptions + +#### `POST /eval` + +Responsible for merging request parameters with evaluator defaults and forwarding the merged result to the task and scorers. + +##### Merging parameters + +The merge is straightforward: + +1. Collect default values from the evaluator's parameter specs (one per named parameter) +2. Overlay the request's `"parameters"` object on top (request values override defaults) +3. Forward the merged result + +``` +defaults = { "model": "gpt-4", "temperature": 0.7 } +request = { "model": "gpt-4o" } +merged = { "model": "gpt-4o", "temperature": 0.7 } +``` + +Parameters not present in the request keep their defaults. Parameters sent by the request but not declared in the evaluator are passed through as-is (do not reject unknown keys). + +If neither the request nor the evaluator defines a value for a parameter, omit it from the merged result. Do not pass `null` or `undefined` for undeclared parameters. + +##### Executing an eval + +The merged parameters must be passed to both the task function and all scorer functions (both local scorers defined in the evaluator and remote scorers sent by the Playground). + +**Backward compatibility is critical.** Tasks and scorers that do not declare they want parameters must not break. How this compatibility is implemented is language-specific (e.g., Ruby uses a `KeywordFilter` to filter out parameters; Python and JavaScript pass parameters via a `hooks` object that callers can ignore). + +##### When no parameters are defined + +If the evaluator defines no parameters and the request body contains no `"parameters"` field, the task and scorers receive no parameters (or an empty value, depending on language conventions). Callers that don't declare `parameters` in their signature are unaffected. + +If the evaluator defines parameters but the request body omits the `"parameters"` field, apply evaluator defaults only (treat the request as if it sent `{}`). + +### Tasks + +The task function may optionally accept a `parameters` argument. If it does, it receives the merged parameter values as a plain key-value mapping with string keys. If it does not, it runs unchanged. + +### Scorers + +Same as task functions. Both local scorers and remote scorers can optionally access parameters. The same merged parameter values are passed to all scorers. + +## Design Decisions + +### String keys + +Parameters arrive as JSON from the Playground, so they naturally have string keys. Evaluator definitions may use language-idiomatic key types (e.g., symbol keys in Ruby), but the merged result forwarded to task/scorer functions must always use string keys, consistent with the JSON origin. + +### Requests override defaults + +When both the request and the evaluator define a value for the same parameter, the request wins. This allows the Playground to be the authoritative source of runtime configuration without requiring the evaluator to hardcode values. + +### No type coercion required for basic implementation + +For simple parameters (strings, numbers, booleans), the value from the request can be used as-is without additional type validation. More advanced implementations may choose to validate or coerce types using the parameter's declared type, but this is not required for baseline correctness. + +### Prompt parameters are a special case + +The Playground defines a `"prompt"` parameter type that carries a full prompt definition (messages, model settings, etc.) rather than a scalar value. When a prompt parameter is sent, its value is a structured JSON object conforming to the prompt data schema. + +SDKs that support prompt parameters may deserialize this JSON into a higher-level `Prompt` object that provides template rendering. SDKs that don't yet support this type can pass the raw JSON object through as-is. Tasks that access a prompt parameter must be aware of the form it takes. + +For a baseline implementation, supporting `"data"` type (generic values) is sufficient. `"prompt"` type support can be added later. + +### Evaluator parameters are optional + +Evaluators that don't define parameters continue to work as before. The `parameters` field in both `GET /list` and `POST /eval` is optional. An evaluator without parameters responds to `GET /list` with `"parameters": null` (or omits the field). + +## Process Flow + +``` +Playground Dev Server Evaluator + | | | + | GET /list | | + |----------------------------->| | + | | read parameter specs | + | |<--------------------------| + | | | + | 200 { "eval-name": { | | + | parameters: { ... } } } | | + |<-----------------------------| | + | | | + | Render UI controls | | + | User adjusts params | | + | | | + | POST /eval | | + | { parameters: { | | + | "model": "gpt-4o" } } | | + |----------------------------->| | + | | read defaults from specs | + | |<--------------------------| + | | | + | | merge: request + defaults | + | | = { "model": "gpt-4o", | + | | "temp": 0.7 } | + | | | + | | run task with merged params + | |-------------------------->| + | | | task(input, parameters) + | | | + | | run scorers with merged params + | |-------------------------->| + | | | scorer(input, expected, output, parameters) + | | | + | SSE: progress events | | + |<-----------------------------| | + | SSE: summary | | + |<-----------------------------| | + | SSE: done | | + |<-----------------------------| | +``` diff --git a/features/remote-evals/params/validation.md b/features/remote-evals/params/validation.md new file mode 100644 index 0000000..790ddb3 --- /dev/null +++ b/features/remote-evals/params/validation.md @@ -0,0 +1,251 @@ +# Remote Eval Parameters: Validation + +This document describes the scenarios and behaviors that an implementation must support to be considered correct. Each scenario includes the conditions, inputs, and expected outcomes. + +--- + +## 1. Parameter Declaration and Discovery + +### 1.1 Evaluator with Parameters Appears in `/list` + +**Purpose**: Confirm the Playground receives parameter metadata when fetching available evaluators. + +**Conditions**: Dev server running with at least one evaluator that declares parameters. + +**Input**: `GET /list` + +**Expected**: +- Response is `200 OK` +- Response body is a JSON object +- The evaluator's entry includes a `"parameters"` field with `"type": "braintrust.staticParameters"` +- Each declared parameter appears in the `"schema"` subfield with its `"default"` and `"description"` preserved +- Parameter names match what was declared in the evaluator definition + +### 1.2 Evaluator Without Parameters Returns Null + +**Purpose**: Confirm evaluators that define no parameters don't break the `/list` response. + +**Conditions**: Dev server running with at least one evaluator that declares no parameters. + +**Input**: `GET /list` + +**Expected**: +- Response is `200 OK` +- The evaluator's entry has `"parameters": null` or omits the field entirely + +### 1.3 Mixed Evaluators in Same `/list` Response + +**Purpose**: Confirm the presence of parameters-enabled and parameters-free evaluators in the same response. + +**Conditions**: Dev server with multiple evaluators, some with parameters and some without. + +**Input**: `GET /list` + +**Expected**: +- All evaluators appear in the response +- Parameter-enabled evaluators have valid `ParametersContainer` structures +- Parameter-free evaluators have `null` or omitted `parameters` + +--- + +## 2. Parameter Merging + +### 2.1 Request Values Override Evaluator Defaults + +**Purpose**: Confirm user-supplied values take precedence over code-defined defaults. + +**Conditions**: Evaluator declares a parameter with a known default value. + +**Input**: `POST /eval` with `parameters: { "": "" }` + +**Expected**: Task function receives `"": ""` (not the default). + +### 2.2 Evaluator Defaults Fill Missing Request Parameters + +**Purpose**: Confirm that parameters not sent in the request still reach the task with their default values. + +**Conditions**: Evaluator declares parameters A and B with defaults. Request only sends a value for A. + +**Input**: `POST /eval` with `parameters: { "A": "override" }` + +**Expected**: Task receives `{ "A": "override", "B": }`. + +### 2.3 All Defaults Applied When Request Omits `parameters` + +**Purpose**: Confirm that omitting `parameters` in the request body still results in defaults being forwarded. + +**Conditions**: Evaluator declares parameters with defaults. + +**Input**: `POST /eval` with no `parameters` field in the body (or `"parameters": null`). + +**Expected**: Task receives all evaluator defaults. + +### 2.4 Empty `parameters` Object Treated Same as Absent + +**Purpose**: `{}` is equivalent to omitting the field -- both mean "no overrides." + +**Conditions**: Evaluator declares parameters with defaults. + +**Input**: `POST /eval` with `"parameters": {}`. + +**Expected**: Task receives all evaluator defaults (same as scenario 2.3). + +### 2.5 Unknown Parameters Passed Through + +**Purpose**: Parameters not declared in the evaluator are forwarded without error. + +**Conditions**: Evaluator declares parameter A. Request sends parameter A and unknown parameter B. + +**Input**: `POST /eval` with `"parameters": { "A": "x", "B": "y" }`. + +**Expected**: Task receives `{ "A": "x", "B": "y" }`. No error is returned. + +--- + +## 3. Task and Scorer Access + +### 3.1 Task That Declares `parameters` Receives Merged Values + +**Purpose**: Verify end-to-end delivery to the task function. + +**Conditions**: Evaluator declares parameter `"suffix"` with default `""`. Task declares and uses `parameters`. + +**Input**: `POST /eval` with `"parameters": { "suffix": "!" }`, one test case with `input: "hello"`. + +**Expected**: Task output is `"hello!"` (input + suffix from parameters). + +### 3.2 Task Without `parameters` Declaration Is Unaffected + +**Purpose**: Backward compatibility -- existing tasks must not break. + +**Conditions**: Evaluator declares parameters. Task is an existing function that does not declare a `parameters` argument. + +**Input**: `POST /eval` with `"parameters": { "suffix": "!" }`. + +**Expected**: Task runs successfully. Task does not receive `parameters` and is not affected by it. + +### 3.3 Scorer That Declares `parameters` Receives Merged Values + +**Purpose**: Verify scorers can access parameters, not just tasks. + +**Conditions**: Evaluator has a scorer that declares `parameters`. Parameter `"threshold"` is sent with value `0.8`. + +**Input**: `POST /eval` with `"parameters": { "threshold": 0.8 }`. + +**Expected**: Scorer receives `parameters` containing `"threshold": 0.8` and can use it in scoring logic. + +### 3.4 Scorer Without `parameters` Declaration Is Unaffected + +**Purpose**: Backward compatibility for scorers. + +**Conditions**: Evaluator declares parameters. Scorer is written without a `parameters` argument. + +**Input**: `POST /eval` with parameters. + +**Expected**: Scorer runs successfully and is not affected. + +### 3.5 All Scorers Receive Same Parameters + +**Purpose**: Confirm consistency when multiple scorers are registered. + +**Conditions**: Evaluator has two local scorers -- one that declares `parameters`, one that doesn't. + +**Input**: `POST /eval` with parameters. + +**Expected**: The scorer with `parameters` declared receives the merged values. The scorer without it runs unchanged. Both produce valid scores. + +### 3.6 Remote Scorers Also Receive Parameters + +**Purpose**: Confirm Playground-provided remote scorers are also passed parameters. + +**Conditions**: Evaluator is run with both local scorers and remote scorers (sent via `"scores"` in the request). + +**Input**: `POST /eval` with parameters and `"scores"` containing remote scorer references. + +**Expected**: Remote scorer functions also receive the merged parameters (as their SDK/invocation mechanism allows). + +--- + +## 4. No-Parameter Cases + +### 4.1 Evaluator Without Parameters Runs Normally + +**Purpose**: Evaluators that never defined parameters continue to work. + +**Conditions**: Evaluator declares no parameters. Request body has no `"parameters"` field. + +**Input**: Standard `POST /eval` request. + +**Expected**: Eval completes successfully. Task and scorers are called without parameters. + +### 4.2 Task Receives Empty Map When No Parameters Defined + +**Purpose**: Tasks that do declare `parameters` get a safe, empty value when the evaluator has no parameter definitions and none were sent. + +**Conditions**: Evaluator has no parameter definitions. Task declares `parameters`. + +**Input**: `POST /eval` with no `"parameters"` field. + +**Expected**: Task receives an empty map (e.g., `{}`), not `null`. Task runs without error. + +--- + +## 5. Correctness of SSE Output + +### 5.1 Progress Events Reflect Task Output Using Parameters + +**Purpose**: Confirm that the task output visible in the SSE stream reflects parameter usage. + +**Conditions**: Task uses a parameter value to modify its output. + +**Input**: `POST /eval` with `"parameters": { "suffix": "!" }`, one case with `input: "hello"`. + +**Expected**: SSE `progress` event contains `"data"` field with the task's output incorporating the parameter (e.g., `"hello!"`). + +### 5.2 Summary Event Is Unaffected by Parameters + +**Purpose**: The SSE `summary` event format doesn't change when parameters are in use. + +**Conditions**: Evaluator with parameters. + +**Input**: `POST /eval` with parameters. + +**Expected**: SSE `summary` event has the same structure as without parameters: `{ "scores": {...}, "experiment_name": ..., "experiment_id": ..., "project_id": ... }`. + +--- + +## 6. Error Handling + +### 6.1 Invalid `parameters` Type Returns 400 + +**Purpose**: Guard against malformed requests. + +**Conditions**: Request sends `parameters` as a non-object (e.g., a string or array). + +**Input**: `POST /eval` with `"parameters": "not-an-object"`. + +**Expected**: `400 Bad Request` with a descriptive error message. + +### 6.2 Parameters Don't Cause Evaluator Lookup to Fail + +**Purpose**: The presence of `parameters` in the request must not interfere with evaluator name resolution. + +**Conditions**: Request includes valid `parameters` for an evaluator that does not define any. + +**Input**: `POST /eval` targeting a no-parameter evaluator, with `"parameters": { "foo": "bar" }`. + +**Expected**: Eval runs normally. No error from unexpected parameters. + +--- + +## 7. Parallel Execution + +### 7.1 Parameters Are Consistent Across Parallel Cases + +**Purpose**: When an eval runs with parallelism > 1, all cases receive the same parameter values. + +**Conditions**: Eval configured with `parallelism: N > 1`. Multiple test cases. + +**Input**: `POST /eval` with parameters and multiple test cases. + +**Expected**: Each test case's task invocation receives the same merged parameter map. Output reflects consistent parameter usage across all cases.