-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Remote evals params #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| # Remote Eval Parameters: Overview | ||
|
|
||
| ## What Are Eval Parameters? | ||
|
|
||
| **Eval parameters** let users configure evaluator behavior from the Braintrust Playground without changing code. Developers declare named parameters in their evaluator -- anything that affects how the eval runs: a model name, a similarity threshold, a feature flag, a service URL, a max output length, etc. The Playground renders these as UI controls (sliders, text inputs, etc.) and passes the user's chosen values to the evaluator when running. | ||
|
|
||
| This makes it easy to compare how a system behaves under different configurations -- for example, running the same test cases with `temperature: 0.2` vs `temperature: 0.9`, or against a staging vs. production endpoint -- without deploying new code. | ||
|
|
||
| ## How It Works | ||
|
|
||
| ``` | ||
| Braintrust Playground Developer's Machine | ||
| +----------------------+ +-------------------------+ | ||
| | | | Dev Server | | ||
| | GET /list | --------------> | | | ||
| | | <-------------- | "food-classifier": | | ||
| | Render UI controls: | parameters: | parameters: | | ||
| | model: [gpt-4 v] | { model: ..., | model: "gpt-4" | | ||
| | temp: [0.7 ---] | temp: ... } | temperature: 0.7 | | ||
| | | | | | ||
| | User changes model | | | | ||
| | to "gpt-4o", clicks | POST /eval | | | ||
| | "Run" | --------------> | parameters: | | ||
| | | parameters: | { model: "gpt-4o", | | ||
| | | { model: | temperature: 0.7 } | | ||
| | | "gpt-4o" } | | | ||
| | Results stream back | <-------------- | task receives params | | ||
| +----------------------+ +-------------------------+ | ||
| ``` | ||
|
|
||
| 1. **Declaration**: The developer declares named parameters in the evaluator definition. Each parameter has a name, optional type, default value, and description. | ||
| 2. **Discovery**: When the Playground fetches `GET /list`, the dev server includes parameter definitions in the response. The Playground renders appropriate UI controls for each parameter. | ||
| 3. **Delivery**: When the user clicks "Run", the Playground sends the current parameter values in the `POST /eval` request body under the `"parameters"` key. | ||
| 4. **Merging**: The dev server merges request values with evaluator defaults (request overrides defaults). This means parameters not changed by the user still have their default values. | ||
| 5. **Forwarding**: The merged parameters are forwarded to the task function and all scorer functions as they run. | ||
|
|
||
| ## Key Concepts | ||
|
|
||
| **Parameter definition** -- A declaration in the evaluator specifying a parameter's name, default value, and optional metadata (type, description). Defined once in code; used to populate UI controls. | ||
|
|
||
| **Parameter values** -- The runtime values the Playground sends per-run. These override any defaults defined in the evaluator. | ||
|
|
||
| **Backward compatibility** -- Tasks and scorers that do not declare they want parameters must continue to work unchanged. The SDK is responsible for filtering parameters out of function calls to functions that don't expect them. | ||
|
|
||
| ## Example | ||
|
|
||
| ```pseudocode | ||
| # Define an evaluator with parameters | ||
| evaluator = Evaluator( | ||
| task = (input, parameters) => MyModel.classify(input, model: parameters["model"]), | ||
| scorers = [ | ||
| Scorer("exact_match", (expected, output) => output == expected ? 1.0 : 0.0) | ||
| ], | ||
| parameters = { | ||
| "model": { type: "model", default: "gpt-4", description: "Model to use" }, | ||
| "temperature": { type: "data", default: 0.7, description: "Sampling temperature" } | ||
| } | ||
| ) | ||
| ``` | ||
|
|
||
| The Playground renders a model picker and temperature input. When the user selects "gpt-4o" and clicks "Run", the task receives `parameters = {"model": "gpt-4o", "temperature": 0.7}` (temperature keeps its default since the user didn't change it). | ||
|
|
||
| ## Parameters vs. Input | ||
|
|
||
| **Input** is per-case data — each test case has its own `input` value (e.g., `"apple"`, `"carrot"`). It varies case-by-case and represents *what* is being evaluated. | ||
|
|
||
| **Parameters** are per-run configuration — the same values apply to every test case in the run. They represent *how* the evaluator behaves. | ||
|
|
||
| The typical workflow: run the same dataset (same inputs) with different parameter values to compare configurations. For example, run `model: "gpt-4"` and `model: "gpt-4o"` against identical test cases, then compare scores side-by-side in the Playground. | ||
|
|
||
| ## Further Reading | ||
|
|
||
| | Document | Purpose | | ||
| |----------|---------| | ||
| | [design.md](design.md) | End-to-end flow, component roles, and design decisions | | ||
| | [contracts.md](contracts.md) | Wire protocol, data types, and API schemas | | ||
| | [validation.md](validation.md) | Test scenarios and expected behaviors | | ||
|
|
||
| ### Related Specs | ||
|
|
||
| - [Remote Eval Dev Server](../server/README.md) -- The broader remote eval feature this builds on | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,176 @@ | ||
| # Remote Eval Parameters: Contracts | ||
|
|
||
| ## SDK | ||
|
|
||
| ### Evaluator | ||
|
|
||
| #### `parameters` | ||
|
|
||
| A map from parameter name to parameter spec, declared in the evaluator definition. This is the source of truth for what parameters exist and what their defaults are. | ||
|
|
||
| ```pseudocode | ||
| evaluator.parameters = { | ||
| "model": { type: "model", default: "gpt-4", description: "Model to use" }, | ||
| "temperature": { type: "data", default: 0.7, description: "Sampling temperature" }, | ||
| "max_length": { type: "data", default: 100, description: "Max output length" } | ||
| } | ||
| ``` | ||
|
|
||
| Each parameter spec: | ||
|
|
||
| | Field | Type | Required | Description | | ||
| |-------|------|----------|-------------| | ||
| | `default` | `any` | No | Value used when the `POST /eval` request does not include this parameter | | ||
| | `description` | `string` | No | Human-readable description shown in the Playground UI | | ||
| | `type` | `string` | No | Type hint — `"data"` (default), `"model"`, or `"prompt"`. See `parameter` entry under `GET /list` Response Format. | | ||
|
|
||
| #### `task` | ||
|
|
||
| A callable that optionally declares a `parameters` argument. When declared, it receives the merged parameter map (request values overlaid on evaluator defaults) as a plain string-keyed object. | ||
|
|
||
| Tasks that do not declare `parameters` must continue to work unchanged — the SDK must not pass `parameters` to functions that don't accept it. | ||
|
|
||
| **Side effect**: the merged `parameters` map is passed to the task function on every test case invocation during a `POST /eval` run. | ||
|
|
||
| #### `scorers` | ||
|
|
||
| Local scorer functions follow the same contract as `task` with respect to parameters — they optionally declare `parameters` and receive the same merged map if they do. The SDK must not pass `parameters` to scorers that don't declare it. | ||
|
|
||
| Remote scorers (sent by the Playground in the `POST /eval` request) also receive the merged parameters via the SDK's remote scorer invocation mechanism. | ||
|
|
||
| **Side effect**: the merged `parameters` map is passed to every scorer function (local and remote) on every test case invocation during a `POST /eval` run. | ||
|
|
||
| ### Dev Server | ||
|
|
||
| #### `GET /list` | ||
|
|
||
| ##### Request Format | ||
|
|
||
| No body. Accepts both `GET` and `POST`. | ||
|
|
||
| ``` | ||
| GET /list | ||
| Authorization: Bearer <token> | ||
| X-Bt-Org-Name: <org> | ||
| ``` | ||
|
|
||
| ##### Response Format | ||
|
|
||
| ``` | ||
| HTTP 200 OK | ||
| Content-Type: application/json | ||
| ``` | ||
|
|
||
| Body: a JSON object keyed by evaluator name. For each evaluator, the `parameters` field contains a `parameters` object serialized from the evaluator's `parameters` definition, or `null` if the evaluator defines no parameters. | ||
|
|
||
| ```json | ||
| { | ||
| "food-classifier": { | ||
| "scores": [{ "name": "exact_match" }], | ||
| "parameters": { | ||
| "type": "braintrust.staticParameters", | ||
| "schema": { | ||
| "model": { | ||
| "type": "data", | ||
| "schema": { "type": "string" }, | ||
| "default": "gpt-4", | ||
| "description": "Model to use" | ||
| }, | ||
| "temperature": { | ||
| "type": "data", | ||
| "schema": { "type": "number" }, | ||
| "default": 0.7, | ||
| "description": "Sampling temperature" | ||
| } | ||
| }, | ||
| "source": null | ||
| } | ||
| }, | ||
| "text-summarizer": { | ||
| "scores": [], | ||
| "parameters": null | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **`parameters` object:** | ||
delner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| | Field | Type | Description | | ||
| |-------|------|-------------| | ||
| | `type` | `string` | Always `"braintrust.staticParameters"` for inline (code-defined) parameters | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think we should also mention
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you explain what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I somehow missed the notifications for these comments.
The playground uses this information to conditionally render things like the Version selector for selecting different versions of your |
||
| | `schema` | `Record<string, parameter>` | Map of parameter name to definition | | ||
| | `source` | `null` | Always `null` for static parameters. Non-null values reference remotely-stored parameter definitions — out of scope for baseline. | | ||
|
|
||
| When the evaluator defines no parameters, set `"parameters": null` or omit the field. | ||
|
|
||
| > **Note for existing SDK implementors**: Prior to the introduction of the container format, some SDKs returned the `schema` map directly (i.e. `Record<string, parameter>`) rather than wrapping it in a `parameters` object with `type` and `source` fields. The container was introduced to distinguish static (inline) parameters from dynamic (remotely-stored) ones. If updating an existing SDK, check whether it predates this format and update accordingly. | ||
|
|
||
| **`parameter` entry** (each value in `schema`): | ||
|
|
||
| | Field | Type | Required | Description | | ||
| |-------|------|----------|-------------| | ||
| | `type` | `string` | Yes | `"data"` for generic values; `"model"` for a model picker; `"prompt"` for a prompt editor. For a baseline implementation, `"data"` is sufficient. | | ||
| | `schema` | `object` | No | JSON Schema fragment describing the value shape. Set `type` to `"string"`, `"number"`, `"boolean"`, `"object"`, or `"array"` to match the parameter's value type. Used by the Playground to render appropriate input controls. Omit if the type is unknown or mixed. | | ||
| | `default` | `any` | No | Default value. Should match the type described by `schema`. | | ||
| | `description` | `string` | No | Human-readable description shown in the Playground UI. | | ||
|
|
||
| **Serialization**: each entry in `evaluator.parameters` maps to a `parameter` entry in the `schema` object. The parameter name becomes the key; the spec fields (`default`, `description`, `type`) are preserved as-is. | ||
|
|
||
| ##### Error Responses | ||
|
|
||
| | Status | Condition | | ||
| |--------|-----------| | ||
| | `401 Unauthorized` | Missing or invalid auth token | | ||
|
|
||
| #### `POST /eval` | ||
|
|
||
| ##### Request Format | ||
|
|
||
| ``` | ||
| POST /eval | ||
| Content-Type: application/json | ||
| Authorization: Bearer <token> | ||
| X-Bt-Org-Name: <org> | ||
| ``` | ||
|
|
||
| The `parameters` field in the request body carries the user's chosen values from the Playground UI: | ||
|
|
||
| ```json | ||
| { | ||
| "name": "food-classifier", | ||
| "data": { ... }, | ||
| "parameters": { | ||
| "model": "gpt-4o", | ||
| "temperature": 0.9 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| | Field | Type | Required | Description | | ||
| |-------|------|----------|-------------| | ||
| | `parameters` | `Record<string, unknown>` | No | Parameter values chosen by the user. Keys match the evaluator's parameter names. Absent, `null`, and `{}` all mean no overrides were provided. | | ||
|
|
||
| See the [Dev Server specification](../server/specification.md) for the full `POST /eval` request schema (all fields beyond `parameters`). | ||
|
|
||
| ##### Response Format | ||
|
|
||
| An SSE stream. The `parameters` field has no effect on the response format — progress, summary, and done events are the same structure as without parameters. | ||
|
|
||
| See the [Dev Server specification](../server/specification.md) for the full SSE event schema. | ||
|
|
||
| **Side effect**: the merged parameters (request values overlaid on evaluator defaults) are forwarded to the task and all scorers on every test case invocation. Output values in the SSE stream reflect whatever the task produced using those parameters. | ||
|
|
||
| ##### Error Responses | ||
|
|
||
| | Status | Condition | | ||
| |--------|-----------| | ||
| | `400 Bad Request` | `parameters` field is present but not a JSON object | | ||
| | `401 Unauthorized` | Missing or invalid auth token | | ||
| | `404 Not Found` | No evaluator registered with the given `name` | | ||
|
|
||
| --- | ||
|
|
||
| ## References | ||
|
|
||
| - [Braintrust: Remote evals guide](https://www.braintrust.dev/docs/evaluate/remote-evals) | ||
| - [Dev Server specification](../server/specification.md) — full `POST /eval` and `GET /list` schemas | ||
Uh oh!
There was an error while loading. Please reload this page.