Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions features/remote-evals/params/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Remote Eval Parameters: Overview

## What Are Eval Parameters?

**Eval parameters** let users configure evaluator behavior from the Braintrust Playground without changing code. Developers declare named parameters in their evaluator -- anything that affects how the eval runs: a model name, a similarity threshold, a feature flag, a service URL, a max output length, etc. The Playground renders these as UI controls (sliders, text inputs, etc.) and passes the user's chosen values to the evaluator when running.

This makes it easy to compare how a system behaves under different configurations -- for example, running the same test cases with `temperature: 0.2` vs `temperature: 0.9`, or against a staging vs. production endpoint -- without deploying new code.

## How It Works

```
Braintrust Playground Developer's Machine
+----------------------+ +-------------------------+
| | | Dev Server |
| GET /list | --------------> | |
| | <-------------- | "food-classifier": |
| Render UI controls: | parameters: | parameters: |
| model: [gpt-4 v] | { model: ..., | model: "gpt-4" |
| temp: [0.7 ---] | temp: ... } | temperature: 0.7 |
| | | |
| User changes model | | |
| to "gpt-4o", clicks | POST /eval | |
| "Run" | --------------> | parameters: |
| | parameters: | { model: "gpt-4o", |
| | { model: | temperature: 0.7 } |
| | "gpt-4o" } | |
| Results stream back | <-------------- | task receives params |
+----------------------+ +-------------------------+
```

1. **Declaration**: The developer declares named parameters in the evaluator definition. Each parameter has a name, optional type, default value, and description.
2. **Discovery**: When the Playground fetches `GET /list`, the dev server includes parameter definitions in the response. The Playground renders appropriate UI controls for each parameter.
3. **Delivery**: When the user clicks "Run", the Playground sends the current parameter values in the `POST /eval` request body under the `"parameters"` key.
4. **Merging**: The dev server merges request values with evaluator defaults (request overrides defaults). This means parameters not changed by the user still have their default values.
5. **Forwarding**: The merged parameters are forwarded to the task function and all scorer functions as they run.

## Key Concepts

**Parameter definition** -- A declaration in the evaluator specifying a parameter's name, default value, and optional metadata (type, description). Defined once in code; used to populate UI controls.

**Parameter values** -- The runtime values the Playground sends per-run. These override any defaults defined in the evaluator.

**Backward compatibility** -- Tasks and scorers that do not declare they want parameters must continue to work unchanged. The SDK is responsible for filtering parameters out of function calls to functions that don't expect them.

## Example

```pseudocode
# Define an evaluator with parameters
evaluator = Evaluator(
task = (input, parameters) => MyModel.classify(input, model: parameters["model"]),
scorers = [
Scorer("exact_match", (expected, output) => output == expected ? 1.0 : 0.0)
],
parameters = {
"model": { type: "model", default: "gpt-4", description: "Model to use" },
"temperature": { type: "data", default: 0.7, description: "Sampling temperature" }
}
)
```

The Playground renders a model picker and temperature input. When the user selects "gpt-4o" and clicks "Run", the task receives `parameters = {"model": "gpt-4o", "temperature": 0.7}` (temperature keeps its default since the user didn't change it).

## Parameters vs. Input

**Input** is per-case data — each test case has its own `input` value (e.g., `"apple"`, `"carrot"`). It varies case-by-case and represents *what* is being evaluated.

**Parameters** are per-run configuration — the same values apply to every test case in the run. They represent *how* the evaluator behaves.

The typical workflow: run the same dataset (same inputs) with different parameter values to compare configurations. For example, run `model: "gpt-4"` and `model: "gpt-4o"` against identical test cases, then compare scores side-by-side in the Playground.

## Further Reading

| Document | Purpose |
|----------|---------|
| [design.md](design.md) | End-to-end flow, component roles, and design decisions |
| [contracts.md](contracts.md) | Wire protocol, data types, and API schemas |
| [validation.md](validation.md) | Test scenarios and expected behaviors |

### Related Specs

- [Remote Eval Dev Server](../server/README.md) -- The broader remote eval feature this builds on
176 changes: 176 additions & 0 deletions features/remote-evals/params/contracts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# Remote Eval Parameters: Contracts

## SDK

### Evaluator

#### `parameters`

A map from parameter name to parameter spec, declared in the evaluator definition. This is the source of truth for what parameters exist and what their defaults are.

```pseudocode
evaluator.parameters = {
"model": { type: "model", default: "gpt-4", description: "Model to use" },
"temperature": { type: "data", default: 0.7, description: "Sampling temperature" },
"max_length": { type: "data", default: 100, description: "Max output length" }
}
```

Each parameter spec:

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `default` | `any` | No | Value used when the `POST /eval` request does not include this parameter |
| `description` | `string` | No | Human-readable description shown in the Playground UI |
| `type` | `string` | No | Type hint — `"data"` (default), `"model"`, or `"prompt"`. See `parameter` entry under `GET /list` Response Format. |

#### `task`

A callable that optionally declares a `parameters` argument. When declared, it receives the merged parameter map (request values overlaid on evaluator defaults) as a plain string-keyed object.

Tasks that do not declare `parameters` must continue to work unchanged — the SDK must not pass `parameters` to functions that don't accept it.

**Side effect**: the merged `parameters` map is passed to the task function on every test case invocation during a `POST /eval` run.

#### `scorers`

Local scorer functions follow the same contract as `task` with respect to parameters — they optionally declare `parameters` and receive the same merged map if they do. The SDK must not pass `parameters` to scorers that don't declare it.

Remote scorers (sent by the Playground in the `POST /eval` request) also receive the merged parameters via the SDK's remote scorer invocation mechanism.

**Side effect**: the merged `parameters` map is passed to every scorer function (local and remote) on every test case invocation during a `POST /eval` run.

### Dev Server

#### `GET /list`

##### Request Format

No body. Accepts both `GET` and `POST`.

```
GET /list
Authorization: Bearer <token>
X-Bt-Org-Name: <org>
```

##### Response Format

```
HTTP 200 OK
Content-Type: application/json
```

Body: a JSON object keyed by evaluator name. For each evaluator, the `parameters` field contains a `parameters` object serialized from the evaluator's `parameters` definition, or `null` if the evaluator defines no parameters.

```json
{
"food-classifier": {
"scores": [{ "name": "exact_match" }],
"parameters": {
"type": "braintrust.staticParameters",
"schema": {
"model": {
"type": "data",
"schema": { "type": "string" },
"default": "gpt-4",
"description": "Model to use"
},
"temperature": {
"type": "data",
"schema": { "type": "number" },
"default": 0.7,
"description": "Sampling temperature"
}
},
"source": null
}
},
"text-summarizer": {
"scores": [],
"parameters": null
}
}
```

**`parameters` object:**

| Field | Type | Description |
|-------|------|-------------|
| `type` | `string` | Always `"braintrust.staticParameters"` for inline (code-defined) parameters |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should also mention braintrust.parameters, or is that out of scope?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what braintrust.parameters is? Not sure how that overlaps (or doesn't.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I somehow missed the notifications for these comments.

braintrust.parameters and braintrust.staticParameters are the DTOs for how the remote eval server communicates with the playground.

braintrust.parameters signals that the user created some Parameters and is loading them here for this remote eval server. (ref)

braintrust.staticParameters signals that the user has defined a schema inline on the Eval

The playground uses this information to conditionally render things like the Version selector for selecting different versions of your Parameters to use in the playground.

| `schema` | `Record<string, parameter>` | Map of parameter name to definition |
| `source` | `null` | Always `null` for static parameters. Non-null values reference remotely-stored parameter definitions — out of scope for baseline. |

When the evaluator defines no parameters, set `"parameters": null` or omit the field.

> **Note for existing SDK implementors**: Prior to the introduction of the container format, some SDKs returned the `schema` map directly (i.e. `Record<string, parameter>`) rather than wrapping it in a `parameters` object with `type` and `source` fields. The container was introduced to distinguish static (inline) parameters from dynamic (remotely-stored) ones. If updating an existing SDK, check whether it predates this format and update accordingly.

**`parameter` entry** (each value in `schema`):

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | `string` | Yes | `"data"` for generic values; `"model"` for a model picker; `"prompt"` for a prompt editor. For a baseline implementation, `"data"` is sufficient. |
| `schema` | `object` | No | JSON Schema fragment describing the value shape. Set `type` to `"string"`, `"number"`, `"boolean"`, `"object"`, or `"array"` to match the parameter's value type. Used by the Playground to render appropriate input controls. Omit if the type is unknown or mixed. |
| `default` | `any` | No | Default value. Should match the type described by `schema`. |
| `description` | `string` | No | Human-readable description shown in the Playground UI. |

**Serialization**: each entry in `evaluator.parameters` maps to a `parameter` entry in the `schema` object. The parameter name becomes the key; the spec fields (`default`, `description`, `type`) are preserved as-is.

##### Error Responses

| Status | Condition |
|--------|-----------|
| `401 Unauthorized` | Missing or invalid auth token |

#### `POST /eval`

##### Request Format

```
POST /eval
Content-Type: application/json
Authorization: Bearer <token>
X-Bt-Org-Name: <org>
```

The `parameters` field in the request body carries the user's chosen values from the Playground UI:

```json
{
"name": "food-classifier",
"data": { ... },
"parameters": {
"model": "gpt-4o",
"temperature": 0.9
}
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `parameters` | `Record<string, unknown>` | No | Parameter values chosen by the user. Keys match the evaluator's parameter names. Absent, `null`, and `{}` all mean no overrides were provided. |

See the [Dev Server specification](../server/specification.md) for the full `POST /eval` request schema (all fields beyond `parameters`).

##### Response Format

An SSE stream. The `parameters` field has no effect on the response format — progress, summary, and done events are the same structure as without parameters.

See the [Dev Server specification](../server/specification.md) for the full SSE event schema.

**Side effect**: the merged parameters (request values overlaid on evaluator defaults) are forwarded to the task and all scorers on every test case invocation. Output values in the SSE stream reflect whatever the task produced using those parameters.

##### Error Responses

| Status | Condition |
|--------|-----------|
| `400 Bad Request` | `parameters` field is present but not a JSON object |
| `401 Unauthorized` | Missing or invalid auth token |
| `404 Not Found` | No evaluator registered with the given `name` |

---

## References

- [Braintrust: Remote evals guide](https://www.braintrust.dev/docs/evaluate/remote-evals)
- [Dev Server specification](../server/specification.md) — full `POST /eval` and `GET /list` schemas
Loading