Skip to content

ChicagoHAI/AutoChecklist

Repository files navigation

AutoChecklist


GitHub Stars Python 3.10+ License Site

AutoChecklist is an open-source library that unifies LLM-based checklist evaluation into composable pipelines, in a pip-installable Python package (autochecklist) with CLI and UI features.

AutoChecklist demo

Features

  • Five checklist generator abstractions that organize methods from research by their reasoning strategies for deriving evaluation criteria
  • Composable pipelines ten built-in configurations implementing published methods, compatible with a unified scorer that consolidates three scoring strategies from literature
  • CLI for off-the-shelf evaluation with pre-defined pipelines
  • Multi-provider LLM backend with support for OpenAI, OpenRouter, and vLLM
  • Locally hosted UI allowing for interactive prompt customization, pipeline configuration, and batch evaluation

Installation

pip

uv pip install autochecklist

# Optional: install vLLM for offline inference (needs GPU)
uv pip install "autochecklist[vllm]"

From source:

# Clone the repository
git clone https://github.com/ChicagoHAI/AutoChecklist.git
cd AutoChecklist

# Install dependencies
uv sync


# Set up environment
cp .env.example .env
# Edit .env and add your API key(s):
#   OPENROUTER_API_KEY=sk-or-...   (for OpenRouter, the default provider)
#   OPENAI_API_KEY=sk-...          (for direct OpenAI or corpus-level embeddings)

To use in another project:

uv pip install -e /path/to/AutoChecklist

Concepts

Tip

Some terminology:

  • input: The instruction, query, or task given to the LLM being evaluated (e.g., "Write a haiku about autumn").
  • target: The output being evaluated against the checklist (e.g., the haiku the LLM produced).
  • reference: An optional gold-standard response used by some methods to improve checklist generation.

Checklist Generator Abstractions

AutoChecklist

The core of the library is 5 generator classes, each implementing a distinct approach to producing checklists:

Level Generator Approach Analogy
Instance DirectGenerator Prompt → checklist Direct inference
Instance ContrastiveGenerator Candidates → checklist Counterfactual reasoning
Corpus InductiveGenerator Observations → criteria Inductive reasoning (bottom-up)
Corpus DeductiveGenerator Dimensions → criteria Deductive reasoning (top-down)
Corpus InteractiveGenerator Eval sessions → criteria Protocol analysis

Instance-level generators produce one checklist per input — criteria are tailored to each specific task. Corpus-level generators produce one checklist for an entire dataset — criteria capture general quality patterns derived from higher-level signals.

Each generator is customizable via prompt templates (.md files with {input}, {target} placeholders). You can use the built-in paper implementations, write your own prompts, or chain generators with different refiners and scorers to build custom evaluation pipelines.

Built-in Pipelines

The library includes built-in pipelines implementing methods from research papers (TICK, RocketEval, RLCF, OpenRubrics, CheckEval, InteractEval, and more). See Supported Pipelines for the full list and configuration details.

Scoring

A single configurable ChecklistScorer class supports all scoring modes:

Config Description
mode="batch" All items in one LLM call (efficient)
mode="batch", capture_reasoning=True Batch with per-item explanations
mode="item" One item per call
mode="item", capture_reasoning=True One item per call with reasoning
mode="item", primary_metric="weighted" Item weights (0-100) for importance
mode="item", use_logprobs=True Logprob confidence calibration

Refiners

Refiners are pipeline stages that clean up raw checklists before scoring. They're used by corpus-level generators internally, and can also be composed into custom pipelines:

  • Deduplicator — merges semantically similar items via embeddings
  • Tagger — filters by applicability and specificity
  • UnitTester — validates that items are enforceable
  • Selector — picks a diverse subset via beam search

Quick Start

from autochecklist import pipeline

pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="Write a haiku about autumn.", target="Leaves fall gently down...")
print(f"Pass rate: {result.pass_rate:.0%}")

See the Quick Start guide for custom prompts, batch evaluation, and more.

CLI

autochecklist run --pipeline tick --data eval_data.jsonl -o results.jsonl \
  --generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-mini

See the CLI guide for all commands.

UI

A web interface for demonstrating autochecklist methods. See ui/README.md for details.

autochecklist ui          # or: cd ui && ./launch_ui.sh
autochecklist ui --dev    # development mode (hot-reload)

The ui subcommand is only available from a source checkout.

Testing

Warning

Integration tests make real LLM API calls that incur costs. The default fast tests (-m 'not integration and not vllm_offline and not openai_api') are free and use no external services. Only run integration tests if you have the required API keys set and are okay with the associated costs.

# Core library fast tests (recommended default — no API calls)
uv run pytest -v -rs tests --ignore=ui/backend/tests -m 'not integration and not vllm_offline and not openai_api'

# Core integration tests (real API calls — requires OPENROUTER_API_KEY)
uv run pytest -v -rs tests --ignore=ui/backend/tests -m integration

# Backend API tests (no API calls needed)
cd ui/backend
uv run pytest -v -rs tests

See tests/README.md and ui/backend/tests/README.md for details.

Automatic Documentation

Preview the API docs locally with MkDocs:

uv run mkdocs serve                       # http://localhost:8000
uv run mkdocs serve -a localhost:7772      # custom port
uv run mkdocs serve -a 0.0.0.0:7772       # expose on network (e.g., http://your-server:7772)

API reference is auto-generated from docstrings via mkdocs-autoapi. New modules are discovered automatically — no manual page creation needed.

Citation

TBA

License

Apache-2.0 (see LICENSE)

About

AutoChecklist is a library that unifies LLM-based checklist evaluation into composable pipelines, available as a Python package and with CLI and UI utilities.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors