This guide explains how to run evaluations on multiple inputs at once using the batch evaluation system.
- Quick Start
- Key Concepts
- Input Formats
- Basic Usage
- Export Options
- Advanced Features
- Configuration Reference
- Examples
The batch evaluation system allows you to process hundreds or thousands of evaluations efficiently with:
- Controlled concurrency - Limit parallel evaluations
- Rate limiting - Respect API quotas
- Progress tracking - Monitor progress in real-time
- Export formats - CSV and JSON
- Result streaming - Process results as they complete via callback
import { anthropic } from "@ai-sdk/anthropic";
import { BatchEvaluator, Evaluator } from "eval-kit";
// 1. Create your evaluator
const qualityEvaluator = new Evaluator({
name: "quality-check",
model: anthropic("claude-3-5-haiku-20241022"),
evaluationPrompt: "Rate the quality of this text from 1-10.",
scoreConfig: { type: "numeric", min: 1, max: 10 },
});
// 2. Create batch evaluator
const batchEvaluator = new BatchEvaluator({
evaluators: [qualityEvaluator],
concurrency: 5,
});
// 3. Run evaluations from CSV/JSON
const result = await batchEvaluator.evaluate({
filePath: "./inputs.csv",
format: "csv",
});
// 4. Export results
await batchEvaluator.export({
format: "csv",
destination: "./results.csv",
});
console.log(`Processed ${result.totalRows} rows`);
console.log(`Success rate: ${(result.successfulRows / result.totalRows * 100).toFixed(1)}%`);It's important to understand the difference between two types of prompts:
Evaluation Prompt (in Evaluator config)
- Defines how to evaluate the content
- Tells the AI what criteria to use
- Same for all rows in the batch
- Example: "Rate the translation quality from 1-10"
Generation Prompt (in input data)
- Describes what prompt generated the content
- Optional metadata for context
- Can be different for each row
- Example: "Translate 'Hello' to French"
// Evaluation prompt - defines the evaluation criteria (same for all)
const evaluator = new Evaluator({
name: "translation-quality",
model: anthropic("claude-3-5-haiku-20241022"),
evaluationPrompt: `Evaluate the translation quality.
Consider accuracy, fluency, and naturalness.
Rate from 1-10.`,
scoreConfig: { type: "numeric", min: 1, max: 10 },
});
// Input data - contains the content to evaluate
const inputs = [
{
candidateText: "Bonjour",
referenceText: "Hello",
},
{
candidateText: "Guten Tag",
referenceText: "Good day",
},
];
// To include the generation prompt, add it to BatchEvaluatorConfig.defaultInput:
const batchEvaluator = new BatchEvaluator({
evaluators: [evaluator],
defaultInput: {
prompt: "Translate the source text", // Applied to all rows
},
});In most cases, you only need the candidateText (and optionally referenceText) in your input data. The generation prompt should be specified in defaultInput if needed for context.
If all your rows share the same generation prompt (or other common fields), you can specify it once in the BatchEvaluatorConfig instead of repeating it in every row:
// All rows were generated with the same prompt
const batchEvaluator = new BatchEvaluator({
evaluators: [evaluator],
concurrency: 5,
defaultInput: {
prompt: "Translate the following text to French", // Applied to all rows
contentType: "translation",
},
});
// Input CSV only needs the varying content
// candidateText,referenceText
// "Bonjour","Hello"
// "Merci","Thank you"The defaultInput fields are merged with each row's data. If a row specifies its own value for a field, it overrides the default.
Your input data contains the content to be evaluated, not the evaluation criteria:
candidateText(required) - The text output you want to evaluatereferenceText(optional) - Ground truth or expected output for comparisonid(optional) - Unique identifier for tracking
Important: The generation prompt (what prompt was used to generate the content) should be specified in BatchEvaluatorConfig.defaultInput.prompt, not in your input file. This avoids repetition when all rows share the same generation prompt.
Minimal example (just the content to evaluate):
candidateText
"The cat sits on the mat"
"Bonjour le monde"
"The weather is nice today"With reference text for comparison:
candidateText,referenceText
"The cat sits on the mat","The cat is sitting on the mat"
"Bonjour le monde","Hello world"With additional metadata:
id,candidateText,referenceText,sourceText,language
1,"Bonjour le monde","Hello world","Hello world","French"
2,"Guten Tag","Good day","Good day","German"Minimal example:
[
{
"candidateText": "The cat sits on the mat"
},
{
"candidateText": "Bonjour le monde"
}
]With reference text:
[
{
"candidateText": "The cat sits on the mat",
"referenceText": "The cat is sitting on the mat"
},
{
"candidateText": "Bonjour le monde",
"referenceText": "Hello world"
}
]With additional metadata:
[
{
"id": "1",
"candidateText": "Bonjour le monde",
"referenceText": "Hello world",
"sourceText": "Hello world",
"language": "French"
},
{
"id": "2",
"candidateText": "Guten Tag",
"referenceText": "Good day",
"sourceText": "Good day",
"language": "German"
}
]If your data is nested, use arrayPath:
{
"metadata": { "version": "1.0" },
"data": {
"evaluations": [
{ "candidateText": "Hello", "prompt": "Translate" }
]
}
}await batchEvaluator.evaluate({
filePath: "./data.json",
format: "json",
jsonOptions: {
arrayPath: "data.evaluations"
}
});Map your column names to standard fields:
await batchEvaluator.evaluate({
filePath: "./custom.csv",
format: "csv",
fieldMapping: {
candidateText: "output_text", // Map "output_text" column to candidateText
referenceText: "expected", // Map "expected" column to referenceText
sourceText: "input", // Map "input" column to sourceText
}
});Define how to evaluate the content. This evaluation prompt is the same for all rows:
import { anthropic } from "@ai-sdk/anthropic";
import { Evaluator } from "eval-kit";
const evaluator = new Evaluator({
name: "translation-quality",
model: anthropic("claude-3-5-haiku-20241022"),
// This evaluation prompt defines the criteria (same for all rows)
evaluationPrompt: `Evaluate the translation quality.
Candidate: {{candidateText}}
{{#if referenceText}}Reference: {{referenceText}}{{/if}}
{{#if prompt}}Context: {{prompt}}{{/if}}
Consider accuracy, fluency, and naturalness. Rate from 1-10.`,
scoreConfig: {
type: "numeric",
min: 1,
max: 10,
float: true,
},
});Note: You can use {{prompt}} in your evaluation prompt to include the generation prompt as context, but it's optional.
import { BatchEvaluator } from "eval-kit";
const batchEvaluator = new BatchEvaluator({
evaluators: [evaluator],
concurrency: 5, // Run 5 evaluations at a time
});const result = await batchEvaluator.evaluate({
filePath: "./inputs.csv",
format: "csv",
});
console.log(`Total: ${result.totalRows}`);
console.log(`Successful: ${result.successfulRows}`);
console.log(`Failed: ${result.failedRows}`);
console.log(`Average Score: ${result.summary.averageScores[evaluator.name]}`);// Individual results
for (const row of result.results) {
console.log(`Row ${row.rowId}:`);
console.log(` Score: ${row.results[0].score}`);
console.log(` Feedback: ${row.results[0].feedback}`);
}
// Summary statistics
console.log(`Average processing time: ${result.summary.averageProcessingTime}ms`);
console.log(`Total tokens used: ${result.summary.totalTokensUsed}`);
console.log(`Error rate: ${(result.summary.errorRate * 100).toFixed(1)}%`);await batchEvaluator.export({
format: "csv",
destination: "./results.csv",
csvOptions: {
flattenResults: true, // Flatten evaluator results into columns
includeHeaders: true, // Include header row
delimiter: ",", // CSV delimiter
},
});Output CSV columns:
rowId,rowIndex,candidateText,evaluatorName,score,feedback,success,executionTime
1,0,"Hello world","translation-quality",8.5,"Good translation",true,1234await batchEvaluator.export({
format: "json",
destination: "./results.json",
jsonOptions: {
pretty: true, // Pretty-print JSON
includeMetadata: true, // Include summary metadata
},
});Output JSON structure:
{
"metadata": {
"exportedAt": "2025-01-01T12:00:00Z",
"totalResults": 100,
"successfulResults": 95,
"failedResults": 5
},
"results": [
{
"rowId": "1",
"rowIndex": 0,
"input": { "candidateText": "..." },
"results": [
{
"evaluatorName": "translation-quality",
"score": 8.5,
"feedback": "Good translation",
"success": true
}
]
}
]
}Export only specific results:
await batchEvaluator.export({
format: "csv",
destination: "./failed-results.csv",
// Only export failed evaluations
filterCondition: (result) => result.error !== undefined,
});Export only specific fields:
await batchEvaluator.export({
format: "json",
destination: "./scores-only.json",
// Only include these fields
includeFields: ["rowId", "rowIndex", "results"],
});Monitor progress in real-time using the onProgress callback:
import { BatchEvaluator, type ProgressEvent } from "eval-kit";
const batchEvaluator = new BatchEvaluator({
evaluators: [myEvaluator],
concurrency: 5,
onProgress: (event: ProgressEvent) => {
const timestamp = new Date().toISOString();
const pct = event.percentComplete.toFixed(1);
const eta = event.estimatedTimeRemaining
? `ETA: ${Math.round(event.estimatedTimeRemaining / 1000)}s`
: "";
console.log(
`[${timestamp}] ${event.type.toUpperCase()} - ${event.processedRows}/${event.totalRows} (${pct}%) ${eta}`
);
},
progressInterval: 2000, // Emit progress every 2 seconds
});| Property | Type | Description |
|---|---|---|
type |
ProgressEventType |
Event type: started, progress, completed, error, retry, paused, resumed |
timestamp |
string |
ISO timestamp of the event |
totalRows |
number |
Total number of rows to process |
processedRows |
number |
Number of rows processed so far |
successfulRows |
number |
Number of successful evaluations |
failedRows |
number |
Number of failed evaluations |
currentRow |
number? |
Index of the row currently being processed |
percentComplete |
number |
Percentage complete (0-100) |
estimatedTimeRemaining |
number? |
Estimated milliseconds remaining |
averageProcessingTime |
number? |
Average milliseconds per row |
currentError |
string? |
Error message if type is error |
retryCount |
number? |
Retry attempt number if type is retry |
estimatedCostUSD |
number? |
Running total estimated cost in USD |
estimatedTokensRemaining |
number? |
Estimated tokens remaining to process |
[2025-12-02T17:45:23.456Z] STARTED - 0/5236 (0.0%)
[2025-12-02T17:45:24.789Z] PROGRESS - 50/5236 (1.0%) ETA: 520s
[2025-12-02T17:45:26.123Z] PROGRESS - 100/5236 (1.9%) ETA: 510s
...
[2025-12-02T17:54:12.456Z] COMPLETED - 5236/5236 (100.0%)
For more detailed progress tracking with cost estimation:
const batchEvaluator = new BatchEvaluator({
evaluators: [myEvaluator],
concurrency: 5,
onProgress: (event) => {
console.log(`Progress: ${event.processedRows}/${event.totalRows}`);
console.log(` Success: ${event.successfulRows}, Failed: ${event.failedRows}`);
console.log(` ${event.percentComplete.toFixed(1)}% complete`);
if (event.estimatedTimeRemaining) {
const minutes = Math.ceil(event.estimatedTimeRemaining / 60000);
console.log(` ETA: ~${minutes} minutes`);
}
if (event.estimatedCostUSD) {
console.log(` Est. cost: $${event.estimatedCostUSD.toFixed(4)}`);
}
},
progressInterval: 2000, // Emit progress every 2 seconds
});Process each result as it completes using the onResult callback. This is useful for real-time logging, database writes, or custom integrations:
import { appendFileSync } from "fs";
const batchEvaluator = new BatchEvaluator({
evaluators: [myEvaluator],
concurrency: 5,
// Handle each result as it completes
onResult: async (result) => {
// Log to console
console.log(`Processed row ${result.rowId}: score ${result.results[0]?.score}`);
// Append to file incrementally
appendFileSync("./results.jsonl", JSON.stringify(result) + "\n");
// Or send to database, webhook, etc.
await saveToDatabase(result);
},
});
await batchEvaluator.evaluate({
filePath: "./inputs.csv",
});Why use onResult:
- Process results in real-time without waiting for the full batch
- Write to files incrementally for fault tolerance
- Send results to external systems as they complete
- Custom logging or alerting on specific conditions
Respect API quotas:
const batchEvaluator = new BatchEvaluator({
evaluators: [myEvaluator],
concurrency: 5,
rateLimit: {
maxRequestsPerMinute: 50, // Max 50 requests per minute
maxRequestsPerHour: 1000, // Max 1000 requests per hour
},
});The system will automatically pause when limits are reached.
Handle transient errors automatically:
const batchEvaluator = new BatchEvaluator({
evaluators: [myEvaluator],
concurrency: 5,
retryConfig: {
maxRetries: 3, // Retry up to 3 times
retryDelay: 1000, // Wait 1 second between retries
exponentialBackoff: true, // Use exponential backoff (1s, 2s, 4s)
retryOnErrors: [ // Only retry these specific errors
"rate limit",
"timeout",
"ECONNRESET",
],
},
});Use startIndex to skip rows that were already processed. This works well with the onResult callback to write results incrementally:
import { appendFileSync, readFileSync, existsSync } from "fs";
// Helper to find where we left off
function getLastProcessedIndex(outputFile: string): number {
if (!existsSync(outputFile)) return 0;
const lines = readFileSync(outputFile, "utf-8").trim().split("\n");
if (lines.length === 0) return 0;
const lastLine = JSON.parse(lines[lines.length - 1]);
return lastLine.rowIndex + 1;
}
const outputFile = "./results.jsonl";
const startIndex = getLastProcessedIndex(outputFile);
console.log(`Resuming from row ${startIndex}`);
const batchEvaluator = new BatchEvaluator({
evaluators: [myEvaluator],
concurrency: 5,
onResult: (result) => {
appendFileSync(outputFile, JSON.stringify(result) + "\n");
},
});
await batchEvaluator.evaluate({
filePath: "./inputs.json",
startIndex, // Skip rows we've already processed
});| Option | Location | Description |
|---|---|---|
startIndex |
BatchInputConfig |
Skip rows before this index (0-based) |
appendToExisting |
BatchExportConfig |
Append to existing file when using export() |
Run multiple evaluators on each input:
const qualityEvaluator = new Evaluator({ name: "quality", /* ... */ });
const tonalityEvaluator = new Evaluator({ name: "tonality", /* ... */ });
const accuracyEvaluator = new Evaluator({ name: "accuracy", /* ... */ });
const batchEvaluator = new BatchEvaluator({
evaluators: [qualityEvaluator, tonalityEvaluator, accuracyEvaluator],
concurrency: 3,
evaluatorExecutionMode: "parallel", // Run evaluators in parallel (default)
// or "sequential" to run one after another
});
const result = await batchEvaluator.evaluate({
filePath: "./inputs.csv",
});
// Results include all evaluators
console.log(result.summary.averageScores);
// { quality: 8.2, tonality: 7.5, accuracy: 9.1 }Process each result with custom logic:
const batchEvaluator = new BatchEvaluator({
evaluators: [myEvaluator],
concurrency: 5,
onResult: async (result) => {
// Custom processing for each result
const score = result.results[0]?.score;
// Log to database
await db.insert({ rowId: result.rowId, score });
// Send alert for low scores
if (typeof score === "number" && score < 5) {
await sendAlert(`Low score detected: ${result.rowId}`);
}
// Update dashboard
await updateDashboard(result);
},
});Set evaluation timeout:
const batchEvaluator = new BatchEvaluator({
evaluators: [myEvaluator],
concurrency: 5,
timeout: 60000, // Fail if evaluation takes longer than 60 seconds
stopOnError: false, // Continue even if some evaluations fail
});interface BatchEvaluatorConfig {
// Required
evaluators: Evaluator[];
// Default input values applied to all rows (optional)
defaultInput?: {
prompt?: string; // Default generation prompt
referenceText?: string; // Default reference text
sourceText?: string; // Default source text
contentType?: string; // Default content type
language?: string; // Default language
[key: string]: unknown; // Additional default fields
};
// Concurrency control (optional)
concurrency?: number; // Default: 5
evaluatorExecutionMode?: "parallel" | "sequential"; // Default: "parallel"
rateLimit?: {
maxRequestsPerMinute?: number;
maxRequestsPerHour?: number;
};
// Error handling (optional)
retryConfig?: {
maxRetries?: number; // Default: 3
retryDelay?: number; // Default: 1000ms
exponentialBackoff?: boolean; // Default: true
retryOnErrors?: string[]; // Specific error messages to retry
};
stopOnError?: boolean; // Default: false
timeout?: number; // Per-evaluation timeout in ms
// Progress tracking (optional)
onProgress?: (event: ProgressEvent) => void | Promise<void>;
progressInterval?: number; // Default: 1000ms
// Result streaming (optional)
onResult?: (result: BatchEvaluationResult) => void | Promise<void>;
}interface BatchInputConfig {
filePath: string;
format?: "csv" | "json" | "auto"; // Default: "auto" (detect from extension)
// Resume support
startIndex?: number; // Skip rows before this index (0-based)
// CSV options
csvOptions?: {
delimiter?: string; // Default: ","
quote?: string; // Default: '"'
escape?: string; // Default: '"'
headers?: boolean; // Default: true
skipEmptyLines?: boolean; // Default: true
encoding?: BufferEncoding; // Default: "utf-8"
};
// JSON options
jsonOptions?: {
arrayPath?: string; // JSONPath to array (e.g., "data.items")
encoding?: BufferEncoding; // Default: "utf-8"
};
// Field mapping
fieldMapping?: {
candidateText: string; // Required
referenceText?: string;
sourceText?: string;
contentType?: string;
language?: string;
id?: string;
};
}interface BatchExportConfig {
format: "csv" | "json";
destination: string; // File path
// Resume support
appendToExisting?: boolean; // Append to existing file instead of overwriting (default: false)
// CSV options
csvOptions?: {
delimiter?: string;
includeHeaders?: boolean;
flattenResults?: boolean;
};
// JSON options
jsonOptions?: {
pretty?: boolean;
includeMetadata?: boolean;
};
// Filtering
includeFields?: string[];
excludeFields?: string[];
filterCondition?: (result: BatchEvaluationResult) => boolean;
}Most commonly, you'll be evaluating multiple outputs that were all generated with the same prompt. In this case, the generation prompt is not needed in your input data:
Input file: llm-outputs.csv
id,candidateText
1,"The quick brown fox jumps over the lazy dog."
2,"To be or not to be, that is the question."
3,"In a galaxy far, far away..."Evaluation code:
import { anthropic } from "@ai-sdk/anthropic";
import { BatchEvaluator, Evaluator } from "eval-kit";
// Define how to evaluate (same criteria for all)
const grammarEvaluator = new Evaluator({
name: "grammar-check",
model: anthropic("claude-3-5-haiku-20241022"),
evaluationPrompt: `Evaluate the grammar and writing quality.
Text: {{candidateText}}
Rate from 1-10 considering:
- Grammar correctness
- Sentence structure
- Clarity
Provide a score and brief feedback.`,
scoreConfig: { type: "numeric", min: 1, max: 10 },
});
// Run batch evaluation
const batchEvaluator = new BatchEvaluator({
evaluators: [grammarEvaluator],
concurrency: 5,
});
const result = await batchEvaluator.evaluate({
filePath: "./llm-outputs.csv",
format: "csv",
});
await batchEvaluator.export({
format: "csv",
destination: "./grammar-scores.csv",
});
console.log(`Evaluated ${result.totalRows} outputs`);Result: grammar-scores.csv
rowId,candidateText,evaluatorName,score,feedback
1,"The quick brown fox...","grammar-check",9,"Excellent grammar and clarity"
2,"To be or not to be...","grammar-check",10,"Perfect sentence structure"
3,"In a galaxy far...","grammar-check",8,"Good structure, slightly informal"import { anthropic } from "@ai-sdk/anthropic";
import { BatchEvaluator, Evaluator } from "eval-kit";
const evaluator = new Evaluator({
name: "translation-quality",
model: anthropic("claude-3-5-haiku-20241022"),
evaluationPrompt: "Rate translation quality 1-10.",
scoreConfig: { type: "numeric", min: 1, max: 10 },
});
const batchEvaluator = new BatchEvaluator({
evaluators: [evaluator],
concurrency: 10,
onProgress: (e) => console.log(`${e.percentComplete.toFixed(1)}%`),
});
const result = await batchEvaluator.evaluate({
filePath: "./translations.csv",
});
await batchEvaluator.export({
format: "csv",
destination: "./results.csv",
});const moderationEvaluator = new Evaluator({
name: "content-moderation",
model: anthropic("claude-3-5-haiku-20241022"),
evaluationPrompt: "Is this content safe? Respond: SAFE or UNSAFE",
scoreConfig: { type: "categorical", categories: ["SAFE", "UNSAFE"] },
});
const batchEvaluator = new BatchEvaluator({
evaluators: [moderationEvaluator],
concurrency: 20,
// Alert on unsafe content as results come in
onResult: async (result) => {
if (result.results[0]?.score === "UNSAFE") {
await sendAlert(`Unsafe content: ${result.rowId}`);
}
},
});
const result = await batchEvaluator.evaluate({
filePath: "./user-content.json",
});
await batchEvaluator.export({
format: "csv",
destination: "./moderation-results.csv",
});const relevanceEvaluator = new Evaluator({
name: "relevance",
model: anthropic("claude-3-5-haiku-20241022"),
evaluationPrompt: "Rate relevance 1-5",
scoreConfig: { type: "numeric", min: 1, max: 5 },
});
const qualityEvaluator = new Evaluator({
name: "quality",
model: anthropic("claude-3-5-haiku-20241022"),
evaluationPrompt: "Rate quality 1-5",
scoreConfig: { type: "numeric", min: 1, max: 5 },
});
const batchEvaluator = new BatchEvaluator({
evaluators: [relevanceEvaluator, qualityEvaluator],
concurrency: 5,
evaluatorExecutionMode: "parallel",
});
const result = await batchEvaluator.evaluate({
filePath: "./search-results.json",
jsonOptions: {
arrayPath: "results.items",
},
});
// Export with calculated combined score
await batchEvaluator.export({
format: "json",
destination: "./analysis.json",
jsonOptions: { pretty: true },
});
console.log("Average Relevance:", result.summary.averageScores.relevance);
console.log("Average Quality:", result.summary.averageScores.quality);const batchEvaluator = new BatchEvaluator({
evaluators: [myEvaluator],
concurrency: 50, // High concurrency
rateLimit: {
maxRequestsPerMinute: 500,
maxRequestsPerHour: 10000,
},
retryConfig: {
maxRetries: 5,
exponentialBackoff: true,
},
onProgress: (e) => {
console.log(`Processed: ${e.processedRows}/${e.totalRows}`);
console.log(`ETA: ${Math.ceil(e.estimatedTimeRemaining! / 60000)} min`);
},
});
const result = await batchEvaluator.evaluate({
filePath: "./large-dataset.csv",
});
console.log(`Processed ${result.totalRows} items in ${result.durationMs / 1000}s`);- Start with low concurrency (5-10) and increase based on API limits
- Use onResult callback for large batches to write results incrementally
- Enable progress tracking to monitor long-running batches
- Set appropriate rate limits based on your API quotas
- Use retry configuration to handle transient errors
- Test with small batches before running large-scale evaluations
- Use
onResultto write results to disk instead of keeping all in memory - Reduce concurrency
- Process in smaller batches
- Reduce concurrency
- Add rate limiting configuration
- Increase retry delay
- Increase timeout configuration
- Reduce concurrency
- Check API latency
- Ensure the same evaluators and configuration are used
- Verify input file hasn't changed
- Check that
startIndexmatches your last processed row + 1