Skip to content

Latest commit

 

History

History
1031 lines (830 loc) · 26.7 KB

File metadata and controls

1031 lines (830 loc) · 26.7 KB

Batch Evaluation Guide

This guide explains how to run evaluations on multiple inputs at once using the batch evaluation system.

Table of Contents


Quick Start

The batch evaluation system allows you to process hundreds or thousands of evaluations efficiently with:

  • Controlled concurrency - Limit parallel evaluations
  • Rate limiting - Respect API quotas
  • Progress tracking - Monitor progress in real-time
  • Export formats - CSV and JSON
  • Result streaming - Process results as they complete via callback

5-Minute Example

import { anthropic } from "@ai-sdk/anthropic";
import { BatchEvaluator, Evaluator } from "eval-kit";

// 1. Create your evaluator
const qualityEvaluator = new Evaluator({
  name: "quality-check",
  model: anthropic("claude-3-5-haiku-20241022"),
  evaluationPrompt: "Rate the quality of this text from 1-10.",
  scoreConfig: { type: "numeric", min: 1, max: 10 },
});

// 2. Create batch evaluator
const batchEvaluator = new BatchEvaluator({
  evaluators: [qualityEvaluator],
  concurrency: 5,
});

// 3. Run evaluations from CSV/JSON
const result = await batchEvaluator.evaluate({
  filePath: "./inputs.csv",
  format: "csv",
});

// 4. Export results
await batchEvaluator.export({
  format: "csv",
  destination: "./results.csv",
});

console.log(`Processed ${result.totalRows} rows`);
console.log(`Success rate: ${(result.successfulRows / result.totalRows * 100).toFixed(1)}%`);

Key Concepts

Evaluation Prompt vs Generation Prompt

It's important to understand the difference between two types of prompts:

Evaluation Prompt (in Evaluator config)

  • Defines how to evaluate the content
  • Tells the AI what criteria to use
  • Same for all rows in the batch
  • Example: "Rate the translation quality from 1-10"

Generation Prompt (in input data)

  • Describes what prompt generated the content
  • Optional metadata for context
  • Can be different for each row
  • Example: "Translate 'Hello' to French"
// Evaluation prompt - defines the evaluation criteria (same for all)
const evaluator = new Evaluator({
  name: "translation-quality",
  model: anthropic("claude-3-5-haiku-20241022"),
  evaluationPrompt: `Evaluate the translation quality.
Consider accuracy, fluency, and naturalness.
Rate from 1-10.`,
  scoreConfig: { type: "numeric", min: 1, max: 10 },
});

// Input data - contains the content to evaluate
const inputs = [
  {
    candidateText: "Bonjour",
    referenceText: "Hello",
  },
  {
    candidateText: "Guten Tag",
    referenceText: "Good day",
  },
];

// To include the generation prompt, add it to BatchEvaluatorConfig.defaultInput:
const batchEvaluator = new BatchEvaluator({
  evaluators: [evaluator],
  defaultInput: {
    prompt: "Translate the source text",  // Applied to all rows
  },
});

In most cases, you only need the candidateText (and optionally referenceText) in your input data. The generation prompt should be specified in defaultInput if needed for context.

Using Default Input for Common Values

If all your rows share the same generation prompt (or other common fields), you can specify it once in the BatchEvaluatorConfig instead of repeating it in every row:

// All rows were generated with the same prompt
const batchEvaluator = new BatchEvaluator({
  evaluators: [evaluator],
  concurrency: 5,
  defaultInput: {
    prompt: "Translate the following text to French",  // Applied to all rows
    contentType: "translation",
  },
});

// Input CSV only needs the varying content
// candidateText,referenceText
// "Bonjour","Hello"
// "Merci","Thank you"

The defaultInput fields are merged with each row's data. If a row specifies its own value for a field, it overrides the default.


Input Formats

Understanding Input Fields

Your input data contains the content to be evaluated, not the evaluation criteria:

  • candidateText (required) - The text output you want to evaluate
  • referenceText (optional) - Ground truth or expected output for comparison
  • id (optional) - Unique identifier for tracking

Important: The generation prompt (what prompt was used to generate the content) should be specified in BatchEvaluatorConfig.defaultInput.prompt, not in your input file. This avoids repetition when all rows share the same generation prompt.

CSV Files

Minimal example (just the content to evaluate):

candidateText
"The cat sits on the mat"
"Bonjour le monde"
"The weather is nice today"

With reference text for comparison:

candidateText,referenceText
"The cat sits on the mat","The cat is sitting on the mat"
"Bonjour le monde","Hello world"

With additional metadata:

id,candidateText,referenceText,sourceText,language
1,"Bonjour le monde","Hello world","Hello world","French"
2,"Guten Tag","Good day","Good day","German"

JSON Files

Minimal example:

[
  {
    "candidateText": "The cat sits on the mat"
  },
  {
    "candidateText": "Bonjour le monde"
  }
]

With reference text:

[
  {
    "candidateText": "The cat sits on the mat",
    "referenceText": "The cat is sitting on the mat"
  },
  {
    "candidateText": "Bonjour le monde",
    "referenceText": "Hello world"
  }
]

With additional metadata:

[
  {
    "id": "1",
    "candidateText": "Bonjour le monde",
    "referenceText": "Hello world",
    "sourceText": "Hello world",
    "language": "French"
  },
  {
    "id": "2",
    "candidateText": "Guten Tag",
    "referenceText": "Good day",
    "sourceText": "Good day",
    "language": "German"
  }
]

Nested JSON

If your data is nested, use arrayPath:

{
  "metadata": { "version": "1.0" },
  "data": {
    "evaluations": [
      { "candidateText": "Hello", "prompt": "Translate" }
    ]
  }
}
await batchEvaluator.evaluate({
  filePath: "./data.json",
  format: "json",
  jsonOptions: {
    arrayPath: "data.evaluations"
  }
});

Custom Field Mapping

Map your column names to standard fields:

await batchEvaluator.evaluate({
  filePath: "./custom.csv",
  format: "csv",
  fieldMapping: {
    candidateText: "output_text",  // Map "output_text" column to candidateText
    referenceText: "expected",      // Map "expected" column to referenceText
    sourceText: "input",            // Map "input" column to sourceText
  }
});

Basic Usage

1. Create Your Evaluator

Define how to evaluate the content. This evaluation prompt is the same for all rows:

import { anthropic } from "@ai-sdk/anthropic";
import { Evaluator } from "eval-kit";

const evaluator = new Evaluator({
  name: "translation-quality",
  model: anthropic("claude-3-5-haiku-20241022"),
  // This evaluation prompt defines the criteria (same for all rows)
  evaluationPrompt: `Evaluate the translation quality.

Candidate: {{candidateText}}
{{#if referenceText}}Reference: {{referenceText}}{{/if}}
{{#if prompt}}Context: {{prompt}}{{/if}}

Consider accuracy, fluency, and naturalness. Rate from 1-10.`,
  scoreConfig: {
    type: "numeric",
    min: 1,
    max: 10,
    float: true,
  },
});

Note: You can use {{prompt}} in your evaluation prompt to include the generation prompt as context, but it's optional.

2. Create Batch Evaluator

import { BatchEvaluator } from "eval-kit";

const batchEvaluator = new BatchEvaluator({
  evaluators: [evaluator],
  concurrency: 5,  // Run 5 evaluations at a time
});

3. Run Batch Evaluation

const result = await batchEvaluator.evaluate({
  filePath: "./inputs.csv",
  format: "csv",
});

console.log(`Total: ${result.totalRows}`);
console.log(`Successful: ${result.successfulRows}`);
console.log(`Failed: ${result.failedRows}`);
console.log(`Average Score: ${result.summary.averageScores[evaluator.name]}`);

4. Access Results

// Individual results
for (const row of result.results) {
  console.log(`Row ${row.rowId}:`);
  console.log(`  Score: ${row.results[0].score}`);
  console.log(`  Feedback: ${row.results[0].feedback}`);
}

// Summary statistics
console.log(`Average processing time: ${result.summary.averageProcessingTime}ms`);
console.log(`Total tokens used: ${result.summary.totalTokensUsed}`);
console.log(`Error rate: ${(result.summary.errorRate * 100).toFixed(1)}%`);

Export Options

Export to CSV

await batchEvaluator.export({
  format: "csv",
  destination: "./results.csv",
  csvOptions: {
    flattenResults: true,     // Flatten evaluator results into columns
    includeHeaders: true,      // Include header row
    delimiter: ",",            // CSV delimiter
  },
});

Output CSV columns:

rowId,rowIndex,candidateText,evaluatorName,score,feedback,success,executionTime
1,0,"Hello world","translation-quality",8.5,"Good translation",true,1234

Export to JSON

await batchEvaluator.export({
  format: "json",
  destination: "./results.json",
  jsonOptions: {
    pretty: true,              // Pretty-print JSON
    includeMetadata: true,     // Include summary metadata
  },
});

Output JSON structure:

{
  "metadata": {
    "exportedAt": "2025-01-01T12:00:00Z",
    "totalResults": 100,
    "successfulResults": 95,
    "failedResults": 5
  },
  "results": [
    {
      "rowId": "1",
      "rowIndex": 0,
      "input": { "candidateText": "..." },
      "results": [
        {
          "evaluatorName": "translation-quality",
          "score": 8.5,
          "feedback": "Good translation",
          "success": true
        }
      ]
    }
  ]
}

Filter Results

Export only specific results:

await batchEvaluator.export({
  format: "csv",
  destination: "./failed-results.csv",
  // Only export failed evaluations
  filterCondition: (result) => result.error !== undefined,
});

Export only specific fields:

await batchEvaluator.export({
  format: "json",
  destination: "./scores-only.json",
  // Only include these fields
  includeFields: ["rowId", "rowIndex", "results"],
});

Advanced Features

Progress Tracking

Monitor progress in real-time using the onProgress callback:

import { BatchEvaluator, type ProgressEvent } from "eval-kit";

const batchEvaluator = new BatchEvaluator({
  evaluators: [myEvaluator],
  concurrency: 5,
  onProgress: (event: ProgressEvent) => {
    const timestamp = new Date().toISOString();
    const pct = event.percentComplete.toFixed(1);
    const eta = event.estimatedTimeRemaining
      ? `ETA: ${Math.round(event.estimatedTimeRemaining / 1000)}s`
      : "";

    console.log(
      `[${timestamp}] ${event.type.toUpperCase()} - ${event.processedRows}/${event.totalRows} (${pct}%) ${eta}`
    );
  },
  progressInterval: 2000,  // Emit progress every 2 seconds
});

ProgressEvent Properties

Property Type Description
type ProgressEventType Event type: started, progress, completed, error, retry, paused, resumed
timestamp string ISO timestamp of the event
totalRows number Total number of rows to process
processedRows number Number of rows processed so far
successfulRows number Number of successful evaluations
failedRows number Number of failed evaluations
currentRow number? Index of the row currently being processed
percentComplete number Percentage complete (0-100)
estimatedTimeRemaining number? Estimated milliseconds remaining
averageProcessingTime number? Average milliseconds per row
currentError string? Error message if type is error
retryCount number? Retry attempt number if type is retry
estimatedCostUSD number? Running total estimated cost in USD
estimatedTokensRemaining number? Estimated tokens remaining to process

Example Output

[2025-12-02T17:45:23.456Z] STARTED - 0/5236 (0.0%)
[2025-12-02T17:45:24.789Z] PROGRESS - 50/5236 (1.0%) ETA: 520s
[2025-12-02T17:45:26.123Z] PROGRESS - 100/5236 (1.9%) ETA: 510s
...
[2025-12-02T17:54:12.456Z] COMPLETED - 5236/5236 (100.0%)

Advanced Progress Tracking

For more detailed progress tracking with cost estimation:

const batchEvaluator = new BatchEvaluator({
  evaluators: [myEvaluator],
  concurrency: 5,
  onProgress: (event) => {
    console.log(`Progress: ${event.processedRows}/${event.totalRows}`);
    console.log(`  Success: ${event.successfulRows}, Failed: ${event.failedRows}`);
    console.log(`  ${event.percentComplete.toFixed(1)}% complete`);

    if (event.estimatedTimeRemaining) {
      const minutes = Math.ceil(event.estimatedTimeRemaining / 60000);
      console.log(`  ETA: ~${minutes} minutes`);
    }

    if (event.estimatedCostUSD) {
      console.log(`  Est. cost: $${event.estimatedCostUSD.toFixed(4)}`);
    }
  },
  progressInterval: 2000,  // Emit progress every 2 seconds
});

Result Streaming with onResult

Process each result as it completes using the onResult callback. This is useful for real-time logging, database writes, or custom integrations:

import { appendFileSync } from "fs";

const batchEvaluator = new BatchEvaluator({
  evaluators: [myEvaluator],
  concurrency: 5,
  // Handle each result as it completes
  onResult: async (result) => {
    // Log to console
    console.log(`Processed row ${result.rowId}: score ${result.results[0]?.score}`);

    // Append to file incrementally
    appendFileSync("./results.jsonl", JSON.stringify(result) + "\n");

    // Or send to database, webhook, etc.
    await saveToDatabase(result);
  },
});

await batchEvaluator.evaluate({
  filePath: "./inputs.csv",
});

Why use onResult:

  • Process results in real-time without waiting for the full batch
  • Write to files incrementally for fault tolerance
  • Send results to external systems as they complete
  • Custom logging or alerting on specific conditions

Rate Limiting

Respect API quotas:

const batchEvaluator = new BatchEvaluator({
  evaluators: [myEvaluator],
  concurrency: 5,
  rateLimit: {
    maxRequestsPerMinute: 50,   // Max 50 requests per minute
    maxRequestsPerHour: 1000,   // Max 1000 requests per hour
  },
});

The system will automatically pause when limits are reached.

Retry Configuration

Handle transient errors automatically:

const batchEvaluator = new BatchEvaluator({
  evaluators: [myEvaluator],
  concurrency: 5,
  retryConfig: {
    maxRetries: 3,              // Retry up to 3 times
    retryDelay: 1000,           // Wait 1 second between retries
    exponentialBackoff: true,   // Use exponential backoff (1s, 2s, 4s)
    retryOnErrors: [            // Only retry these specific errors
      "rate limit",
      "timeout",
      "ECONNRESET",
    ],
  },
});

Resuming Interrupted Batches

Use startIndex to skip rows that were already processed. This works well with the onResult callback to write results incrementally:

import { appendFileSync, readFileSync, existsSync } from "fs";

// Helper to find where we left off
function getLastProcessedIndex(outputFile: string): number {
  if (!existsSync(outputFile)) return 0;
  const lines = readFileSync(outputFile, "utf-8").trim().split("\n");
  if (lines.length === 0) return 0;
  const lastLine = JSON.parse(lines[lines.length - 1]);
  return lastLine.rowIndex + 1;
}

const outputFile = "./results.jsonl";
const startIndex = getLastProcessedIndex(outputFile);

console.log(`Resuming from row ${startIndex}`);

const batchEvaluator = new BatchEvaluator({
  evaluators: [myEvaluator],
  concurrency: 5,
  onResult: (result) => {
    appendFileSync(outputFile, JSON.stringify(result) + "\n");
  },
});

await batchEvaluator.evaluate({
  filePath: "./inputs.json",
  startIndex,  // Skip rows we've already processed
});
Option Location Description
startIndex BatchInputConfig Skip rows before this index (0-based)
appendToExisting BatchExportConfig Append to existing file when using export()

Multiple Evaluators

Run multiple evaluators on each input:

const qualityEvaluator = new Evaluator({ name: "quality", /* ... */ });
const tonalityEvaluator = new Evaluator({ name: "tonality", /* ... */ });
const accuracyEvaluator = new Evaluator({ name: "accuracy", /* ... */ });

const batchEvaluator = new BatchEvaluator({
  evaluators: [qualityEvaluator, tonalityEvaluator, accuracyEvaluator],
  concurrency: 3,
  evaluatorExecutionMode: "parallel",  // Run evaluators in parallel (default)
  // or "sequential" to run one after another
});

const result = await batchEvaluator.evaluate({
  filePath: "./inputs.csv",
});

// Results include all evaluators
console.log(result.summary.averageScores);
// { quality: 8.2, tonality: 7.5, accuracy: 9.1 }

Custom Result Processing

Process each result with custom logic:

const batchEvaluator = new BatchEvaluator({
  evaluators: [myEvaluator],
  concurrency: 5,
  onResult: async (result) => {
    // Custom processing for each result
    const score = result.results[0]?.score;

    // Log to database
    await db.insert({ rowId: result.rowId, score });

    // Send alert for low scores
    if (typeof score === "number" && score < 5) {
      await sendAlert(`Low score detected: ${result.rowId}`);
    }

    // Update dashboard
    await updateDashboard(result);
  },
});

Timeout Configuration

Set evaluation timeout:

const batchEvaluator = new BatchEvaluator({
  evaluators: [myEvaluator],
  concurrency: 5,
  timeout: 60000,  // Fail if evaluation takes longer than 60 seconds
  stopOnError: false,  // Continue even if some evaluations fail
});

Configuration Reference

BatchEvaluatorConfig

interface BatchEvaluatorConfig {
  // Required
  evaluators: Evaluator[];

  // Default input values applied to all rows (optional)
  defaultInput?: {
    prompt?: string;         // Default generation prompt
    referenceText?: string;  // Default reference text
    sourceText?: string;     // Default source text
    contentType?: string;    // Default content type
    language?: string;       // Default language
    [key: string]: unknown;  // Additional default fields
  };

  // Concurrency control (optional)
  concurrency?: number;  // Default: 5
  evaluatorExecutionMode?: "parallel" | "sequential";  // Default: "parallel"
  rateLimit?: {
    maxRequestsPerMinute?: number;
    maxRequestsPerHour?: number;
  };

  // Error handling (optional)
  retryConfig?: {
    maxRetries?: number;        // Default: 3
    retryDelay?: number;         // Default: 1000ms
    exponentialBackoff?: boolean; // Default: true
    retryOnErrors?: string[];    // Specific error messages to retry
  };
  stopOnError?: boolean;  // Default: false
  timeout?: number;       // Per-evaluation timeout in ms

  // Progress tracking (optional)
  onProgress?: (event: ProgressEvent) => void | Promise<void>;
  progressInterval?: number;  // Default: 1000ms

  // Result streaming (optional)
  onResult?: (result: BatchEvaluationResult) => void | Promise<void>;
}

BatchInputConfig

interface BatchInputConfig {
  filePath: string;
  format?: "csv" | "json" | "auto";  // Default: "auto" (detect from extension)

  // Resume support
  startIndex?: number;  // Skip rows before this index (0-based)

  // CSV options
  csvOptions?: {
    delimiter?: string;      // Default: ","
    quote?: string;          // Default: '"'
    escape?: string;         // Default: '"'
    headers?: boolean;       // Default: true
    skipEmptyLines?: boolean; // Default: true
    encoding?: BufferEncoding; // Default: "utf-8"
  };

  // JSON options
  jsonOptions?: {
    arrayPath?: string;      // JSONPath to array (e.g., "data.items")
    encoding?: BufferEncoding; // Default: "utf-8"
  };

  // Field mapping
  fieldMapping?: {
    candidateText: string;  // Required
    referenceText?: string;
    sourceText?: string;
    contentType?: string;
    language?: string;
    id?: string;
  };
}

BatchExportConfig

interface BatchExportConfig {
  format: "csv" | "json";
  destination: string;  // File path

  // Resume support
  appendToExisting?: boolean;  // Append to existing file instead of overwriting (default: false)

  // CSV options
  csvOptions?: {
    delimiter?: string;
    includeHeaders?: boolean;
    flattenResults?: boolean;
  };

  // JSON options
  jsonOptions?: {
    pretty?: boolean;
    includeMetadata?: boolean;
  };

  // Filtering
  includeFields?: string[];
  excludeFields?: string[];
  filterCondition?: (result: BatchEvaluationResult) => boolean;
}

Examples

Example 1: Evaluating LLM Outputs (Common Case)

Most commonly, you'll be evaluating multiple outputs that were all generated with the same prompt. In this case, the generation prompt is not needed in your input data:

Input file: llm-outputs.csv

id,candidateText
1,"The quick brown fox jumps over the lazy dog."
2,"To be or not to be, that is the question."
3,"In a galaxy far, far away..."

Evaluation code:

import { anthropic } from "@ai-sdk/anthropic";
import { BatchEvaluator, Evaluator } from "eval-kit";

// Define how to evaluate (same criteria for all)
const grammarEvaluator = new Evaluator({
  name: "grammar-check",
  model: anthropic("claude-3-5-haiku-20241022"),
  evaluationPrompt: `Evaluate the grammar and writing quality.

Text: {{candidateText}}

Rate from 1-10 considering:
- Grammar correctness
- Sentence structure
- Clarity

Provide a score and brief feedback.`,
  scoreConfig: { type: "numeric", min: 1, max: 10 },
});

// Run batch evaluation
const batchEvaluator = new BatchEvaluator({
  evaluators: [grammarEvaluator],
  concurrency: 5,
});

const result = await batchEvaluator.evaluate({
  filePath: "./llm-outputs.csv",
  format: "csv",
});

await batchEvaluator.export({
  format: "csv",
  destination: "./grammar-scores.csv",
});

console.log(`Evaluated ${result.totalRows} outputs`);

Result: grammar-scores.csv

rowId,candidateText,evaluatorName,score,feedback
1,"The quick brown fox...","grammar-check",9,"Excellent grammar and clarity"
2,"To be or not to be...","grammar-check",10,"Perfect sentence structure"
3,"In a galaxy far...","grammar-check",8,"Good structure, slightly informal"

Example 2: Translation Quality Check

import { anthropic } from "@ai-sdk/anthropic";
import { BatchEvaluator, Evaluator } from "eval-kit";

const evaluator = new Evaluator({
  name: "translation-quality",
  model: anthropic("claude-3-5-haiku-20241022"),
  evaluationPrompt: "Rate translation quality 1-10.",
  scoreConfig: { type: "numeric", min: 1, max: 10 },
});

const batchEvaluator = new BatchEvaluator({
  evaluators: [evaluator],
  concurrency: 10,
  onProgress: (e) => console.log(`${e.percentComplete.toFixed(1)}%`),
});

const result = await batchEvaluator.evaluate({
  filePath: "./translations.csv",
});

await batchEvaluator.export({
  format: "csv",
  destination: "./results.csv",
});

Example 2: Content Moderation with Real-time Alerts

const moderationEvaluator = new Evaluator({
  name: "content-moderation",
  model: anthropic("claude-3-5-haiku-20241022"),
  evaluationPrompt: "Is this content safe? Respond: SAFE or UNSAFE",
  scoreConfig: { type: "categorical", categories: ["SAFE", "UNSAFE"] },
});

const batchEvaluator = new BatchEvaluator({
  evaluators: [moderationEvaluator],
  concurrency: 20,
  // Alert on unsafe content as results come in
  onResult: async (result) => {
    if (result.results[0]?.score === "UNSAFE") {
      await sendAlert(`Unsafe content: ${result.rowId}`);
    }
  },
});

const result = await batchEvaluator.evaluate({
  filePath: "./user-content.json",
});

await batchEvaluator.export({
  format: "csv",
  destination: "./moderation-results.csv",
});

Example 3: Multi-Evaluator Pipeline

const relevanceEvaluator = new Evaluator({
  name: "relevance",
  model: anthropic("claude-3-5-haiku-20241022"),
  evaluationPrompt: "Rate relevance 1-5",
  scoreConfig: { type: "numeric", min: 1, max: 5 },
});

const qualityEvaluator = new Evaluator({
  name: "quality",
  model: anthropic("claude-3-5-haiku-20241022"),
  evaluationPrompt: "Rate quality 1-5",
  scoreConfig: { type: "numeric", min: 1, max: 5 },
});

const batchEvaluator = new BatchEvaluator({
  evaluators: [relevanceEvaluator, qualityEvaluator],
  concurrency: 5,
  evaluatorExecutionMode: "parallel",
});

const result = await batchEvaluator.evaluate({
  filePath: "./search-results.json",
  jsonOptions: {
    arrayPath: "results.items",
  },
});

// Export with calculated combined score
await batchEvaluator.export({
  format: "json",
  destination: "./analysis.json",
  jsonOptions: { pretty: true },
});

console.log("Average Relevance:", result.summary.averageScores.relevance);
console.log("Average Quality:", result.summary.averageScores.quality);

Example 4: High-Volume with Rate Limiting

const batchEvaluator = new BatchEvaluator({
  evaluators: [myEvaluator],
  concurrency: 50,  // High concurrency
  rateLimit: {
    maxRequestsPerMinute: 500,
    maxRequestsPerHour: 10000,
  },
  retryConfig: {
    maxRetries: 5,
    exponentialBackoff: true,
  },
  onProgress: (e) => {
    console.log(`Processed: ${e.processedRows}/${e.totalRows}`);
    console.log(`ETA: ${Math.ceil(e.estimatedTimeRemaining! / 60000)} min`);
  },
});

const result = await batchEvaluator.evaluate({
  filePath: "./large-dataset.csv",
});

console.log(`Processed ${result.totalRows} items in ${result.durationMs / 1000}s`);

Best Practices

  1. Start with low concurrency (5-10) and increase based on API limits
  2. Use onResult callback for large batches to write results incrementally
  3. Enable progress tracking to monitor long-running batches
  4. Set appropriate rate limits based on your API quotas
  5. Use retry configuration to handle transient errors
  6. Test with small batches before running large-scale evaluations

Troubleshooting

High memory usage

  • Use onResult to write results to disk instead of keeping all in memory
  • Reduce concurrency
  • Process in smaller batches

Rate limit errors

  • Reduce concurrency
  • Add rate limiting configuration
  • Increase retry delay

Timeout errors

  • Increase timeout configuration
  • Reduce concurrency
  • Check API latency

Resume not working correctly

  • Ensure the same evaluators and configuration are used
  • Verify input file hasn't changed
  • Check that startIndex matches your last processed row + 1