Refactor TypeScript SDK for better DX

## Summary

The AgentV TypeScript SDK (`@agentv/core`, `@agentv/eval`) should be refactored to offer a developer experience on par with Braintrust's TypeScript SDK. Braintrust nails the "single-file, declarative eval" pattern that makes writing evals feel as natural as writing tests.

## What Braintrust gets right

### 1. Single declarative `Eval()` call — one file, zero ceremony

```typescript
import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("my-eval", {
  data: () => [
    { input: "What is 2+2?", expected: "4" },
  ],
  task: async (input) => {
    return await callMyLLM(input);
  },
  scores: [Factuality],
});
```

That's a complete, runnable eval. One file. One import. Data, task, and scoring all co-located. Run with `bunx braintrust eval my-eval.ts`.

### 2. Scorers are just functions

```typescript
scores: [
  async ({ input, output, expected }) => ({
    name: "CriteriaMatch",
    score: output.includes("4") ? 1.0 : 0.0,
    metadata: { rationale: "Checked for correct answer" },
  }),
],
```

No `defineCodeJudge()`, no `defineAssertion()`, no stdin/stdout protocol, no separate files. Just a function that returns `{ name, score }`.

### 3. Pre-built scorers via `autoevals`

```typescript
import { Factuality, Levenshtein } from "autoevals";
scores: [Levenshtein]  // one-liner
```

Battle-tested LLM-as-judge scorers available out of the box.

### 4. First-class metadata at every level

Data records, eval config, and scorer results all support `metadata` — enabling filtering, grouping, and rich debugging in the UI.

---

## Where AgentV's TypeScript SDK falls short

| Concern | Braintrust | AgentV |
|---------|-----------|--------|
| Minimal eval | **1 file, ~15 lines** | YAML + separate `.ts` files for judges/assertions |
| Define a scorer | Inline function `({ input, output, expected }) => score` | `defineCodeJudge()` or `defineAssertion()` in a separate file under `.agentv/` |
| Pre-built scorers | `autoevals` package (Factuality, Levenshtein, etc.) | None — must write from scratch or use LLM judge YAML config |
| Task function | `async (input) => output` — use any SDK, any provider | Tied to target provider config in YAML |
| Run command | `bunx braintrust eval file.ts` | `agentv eval file.yaml` (YAML-first, TS is an escape hatch) |
| Co-location | Data + task + scores in one file | Spread across YAML spec, assertion files, judge files, config files |

## Proposed changes

### A. Support a single-file `Eval()` pattern

```typescript
import { Eval } from "@agentv/core";

Eval("customs-rules", {
  data: () => [
    { input: "BE", expected: "Should ask for missing rule codes" },
  ],
  task: async (input) => { /* call your agent */ },
  scores: [
    async ({ input, output, expected }) => ({
      name: "CriteriaMatch",
      score: 0.8,
    }),
  ],
});
```

This would be a **TypeScript-first** entry point alongside the existing YAML flow — not replacing it.

### B. Allow inline scorer functions

Scorers should be expressible as plain functions without requiring `defineCodeJudge` / `defineAssertion` wrappers or separate files. The current helpers are fine for advanced cases but shouldn't be the only path.

### C. Ship a pre-built scorers package

An `@agentv/scorers` (or built into `@agentv/eval`) with common patterns:
- `Contains` / `ExactMatch` — deterministic checks
- `Factuality` / `Relevance` — LLM-as-judge wrappers
- `JSONMatch` — structural comparison

### D. Make the task function provider-agnostic

The `task` function should accept any async function, not require a target provider config. Let users bring their own OpenAI/Anthropic/OpenRouter client.

### E. First-class metadata

Support `metadata` on data records and scorer results for filtering and analysis.

---

## Goal

A developer should be able to go from zero to a working eval in **one TypeScript file, under 20 lines**, with the same ease as writing a Jest test. The YAML-first approach is great for non-developers, but the TypeScript SDK should be equally first-class for engineers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor TypeScript SDK for better DX #537

Summary

What Braintrust gets right

1. Single declarative `Eval()` call — one file, zero ceremony

2. Scorers are just functions

3. Pre-built scorers via `autoevals`

4. First-class metadata at every level

Where AgentV's TypeScript SDK falls short

Proposed changes

A. Support a single-file `Eval()` pattern

B. Allow inline scorer functions

C. Ship a pre-built scorers package

D. Make the task function provider-agnostic

E. First-class metadata

Goal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Concern	Braintrust	AgentV
Minimal eval	1 file, ~15 lines	YAML + separate `.ts` files for judges/assertions
Define a scorer	Inline function `({ input, output, expected }) => score`	`defineCodeJudge()` or `defineAssertion()` in a separate file under `.agentv/`
Pre-built scorers	`autoevals` package (Factuality, Levenshtein, etc.)	None — must write from scratch or use LLM judge YAML config
Task function	`async (input) => output` — use any SDK, any provider	Tied to target provider config in YAML
Run command	`bunx braintrust eval file.ts`	`agentv eval file.yaml` (YAML-first, TS is an escape hatch)
Co-location	Data + task + scores in one file	Spread across YAML spec, assertion files, judge files, config files

Refactor TypeScript SDK for better DX #537

Description

Summary

What Braintrust gets right

1. Single declarative Eval() call — one file, zero ceremony

2. Scorers are just functions

3. Pre-built scorers via autoevals

4. First-class metadata at every level

Where AgentV's TypeScript SDK falls short

Proposed changes

A. Support a single-file Eval() pattern

B. Allow inline scorer functions

C. Ship a pre-built scorers package

D. Make the task function provider-agnostic

E. First-class metadata

Goal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Single declarative `Eval()` call — one file, zero ceremony

3. Pre-built scorers via `autoevals`

A. Support a single-file `Eval()` pattern