-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
The AgentV TypeScript SDK (@agentv/core, @agentv/eval) should be refactored to offer a developer experience on par with Braintrust's TypeScript SDK. Braintrust nails the "single-file, declarative eval" pattern that makes writing evals feel as natural as writing tests.
What Braintrust gets right
1. Single declarative Eval() call — one file, zero ceremony
import { Eval } from "braintrust";
import { Factuality } from "autoevals";
Eval("my-eval", {
data: () => [
{ input: "What is 2+2?", expected: "4" },
],
task: async (input) => {
return await callMyLLM(input);
},
scores: [Factuality],
});That's a complete, runnable eval. One file. One import. Data, task, and scoring all co-located. Run with bunx braintrust eval my-eval.ts.
2. Scorers are just functions
scores: [
async ({ input, output, expected }) => ({
name: "CriteriaMatch",
score: output.includes("4") ? 1.0 : 0.0,
metadata: { rationale: "Checked for correct answer" },
}),
],No defineCodeJudge(), no defineAssertion(), no stdin/stdout protocol, no separate files. Just a function that returns { name, score }.
3. Pre-built scorers via autoevals
import { Factuality, Levenshtein } from "autoevals";
scores: [Levenshtein] // one-linerBattle-tested LLM-as-judge scorers available out of the box.
4. First-class metadata at every level
Data records, eval config, and scorer results all support metadata — enabling filtering, grouping, and rich debugging in the UI.
Where AgentV's TypeScript SDK falls short
| Concern | Braintrust | AgentV |
|---|---|---|
| Minimal eval | 1 file, ~15 lines | YAML + separate .ts files for judges/assertions |
| Define a scorer | Inline function ({ input, output, expected }) => score |
defineCodeJudge() or defineAssertion() in a separate file under .agentv/ |
| Pre-built scorers | autoevals package (Factuality, Levenshtein, etc.) |
None — must write from scratch or use LLM judge YAML config |
| Task function | async (input) => output — use any SDK, any provider |
Tied to target provider config in YAML |
| Run command | bunx braintrust eval file.ts |
agentv eval file.yaml (YAML-first, TS is an escape hatch) |
| Co-location | Data + task + scores in one file | Spread across YAML spec, assertion files, judge files, config files |
Proposed changes
A. Support a single-file Eval() pattern
import { Eval } from "@agentv/core";
Eval("customs-rules", {
data: () => [
{ input: "BE", expected: "Should ask for missing rule codes" },
],
task: async (input) => { /* call your agent */ },
scores: [
async ({ input, output, expected }) => ({
name: "CriteriaMatch",
score: 0.8,
}),
],
});This would be a TypeScript-first entry point alongside the existing YAML flow — not replacing it.
B. Allow inline scorer functions
Scorers should be expressible as plain functions without requiring defineCodeJudge / defineAssertion wrappers or separate files. The current helpers are fine for advanced cases but shouldn't be the only path.
C. Ship a pre-built scorers package
An @agentv/scorers (or built into @agentv/eval) with common patterns:
Contains/ExactMatch— deterministic checksFactuality/Relevance— LLM-as-judge wrappersJSONMatch— structural comparison
D. Make the task function provider-agnostic
The task function should accept any async function, not require a target provider config. Let users bring their own OpenAI/Anthropic/OpenRouter client.
E. First-class metadata
Support metadata on data records and scorer results for filtering and analysis.
Goal
A developer should be able to go from zero to a working eval in one TypeScript file, under 20 lines, with the same ease as writing a Jest test. The YAML-first approach is great for non-developers, but the TypeScript SDK should be equally first-class for engineers.