Skip to content

Refactor TypeScript SDK for better DX #537

@christso

Description

@christso

Summary

The AgentV TypeScript SDK (@agentv/core, @agentv/eval) should be refactored to offer a developer experience on par with Braintrust's TypeScript SDK. Braintrust nails the "single-file, declarative eval" pattern that makes writing evals feel as natural as writing tests.

What Braintrust gets right

1. Single declarative Eval() call — one file, zero ceremony

import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("my-eval", {
  data: () => [
    { input: "What is 2+2?", expected: "4" },
  ],
  task: async (input) => {
    return await callMyLLM(input);
  },
  scores: [Factuality],
});

That's a complete, runnable eval. One file. One import. Data, task, and scoring all co-located. Run with bunx braintrust eval my-eval.ts.

2. Scorers are just functions

scores: [
  async ({ input, output, expected }) => ({
    name: "CriteriaMatch",
    score: output.includes("4") ? 1.0 : 0.0,
    metadata: { rationale: "Checked for correct answer" },
  }),
],

No defineCodeJudge(), no defineAssertion(), no stdin/stdout protocol, no separate files. Just a function that returns { name, score }.

3. Pre-built scorers via autoevals

import { Factuality, Levenshtein } from "autoevals";
scores: [Levenshtein]  // one-liner

Battle-tested LLM-as-judge scorers available out of the box.

4. First-class metadata at every level

Data records, eval config, and scorer results all support metadata — enabling filtering, grouping, and rich debugging in the UI.


Where AgentV's TypeScript SDK falls short

Concern Braintrust AgentV
Minimal eval 1 file, ~15 lines YAML + separate .ts files for judges/assertions
Define a scorer Inline function ({ input, output, expected }) => score defineCodeJudge() or defineAssertion() in a separate file under .agentv/
Pre-built scorers autoevals package (Factuality, Levenshtein, etc.) None — must write from scratch or use LLM judge YAML config
Task function async (input) => output — use any SDK, any provider Tied to target provider config in YAML
Run command bunx braintrust eval file.ts agentv eval file.yaml (YAML-first, TS is an escape hatch)
Co-location Data + task + scores in one file Spread across YAML spec, assertion files, judge files, config files

Proposed changes

A. Support a single-file Eval() pattern

import { Eval } from "@agentv/core";

Eval("customs-rules", {
  data: () => [
    { input: "BE", expected: "Should ask for missing rule codes" },
  ],
  task: async (input) => { /* call your agent */ },
  scores: [
    async ({ input, output, expected }) => ({
      name: "CriteriaMatch",
      score: 0.8,
    }),
  ],
});

This would be a TypeScript-first entry point alongside the existing YAML flow — not replacing it.

B. Allow inline scorer functions

Scorers should be expressible as plain functions without requiring defineCodeJudge / defineAssertion wrappers or separate files. The current helpers are fine for advanced cases but shouldn't be the only path.

C. Ship a pre-built scorers package

An @agentv/scorers (or built into @agentv/eval) with common patterns:

  • Contains / ExactMatch — deterministic checks
  • Factuality / Relevance — LLM-as-judge wrappers
  • JSONMatch — structural comparison

D. Make the task function provider-agnostic

The task function should accept any async function, not require a target provider config. Let users bring their own OpenAI/Anthropic/OpenRouter client.

E. First-class metadata

Support metadata on data records and scorer results for filtering and analysis.


Goal

A developer should be able to go from zero to a working eval in one TypeScript file, under 20 lines, with the same ease as writing a Jest test. The YAML-first approach is great for non-developers, but the TypeScript SDK should be equally first-class for engineers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions