GitHub - kazhou/AutoChecklist: AutoChecklist is a library that unifies LLM-based checklist evaluation into composable pipelines, available as a Python package and with CLI and UI utilities.

AutoChecklist is an open-source library that unifies LLM-based checklist evaluation into composable pipelines, in a pip-installable Python package (autochecklist) with CLI and UI features.

Features

Five checklist generator abstractions that organize methods from research by their reasoning strategies for deriving evaluation criteria
Composable pipelines eight built-in configurations implementing published methods, compatible with a unified scorer that consolidates three scoring strategies from literature
CLI for off-the-shelf evaluation with pre-defined pipelines
Multi-provider LLM backend with support for OpenAI, OpenRouter, and vLLM
Locally hosted UI allowing for interactive prompt customization, pipeline configuration, and batch evaluation

Installation

`pip`

uv pip install autochecklist

# Optional: install vLLM for offline inference (needs GPU)
uv pip install "autochecklist[vllm]"

From source:

# Clone the repository
git clone https://github.com/ChicagoHAI/AutoChecklist.git
cd AutoChecklist

# Install dependencies
uv sync


# Set up environment
cp .env.example .env
# Edit .env and add your API key(s):
#   OPENROUTER_API_KEY=sk-or-...   (for OpenRouter, the default provider)
#   OPENAI_API_KEY=sk-...          (for direct OpenAI or corpus-level embeddings)

To use in another project:

uv pip install -e /path/to/AutoChecklist

Concepts

Tip

Some terminology:

input: The instruction, query, or task given to the LLM being evaluated (e.g., "Write a haiku about autumn").
target: The output being evaluated against the checklist (e.g., the haiku the LLM produced).
reference: An optional gold-standard response used by some methods to improve checklist generation.

Checklist Generator Abstractions

The core of the library is 5 generator classes, each implementing a distinct approach to producing checklists:

Level	Generator	Approach	Analogy
Instance	`DirectGenerator`	Prompt → checklist	Direct inference
Instance	`ContrastiveGenerator`	Candidates → checklist	Counterfactual reasoning
Corpus	`InductiveGenerator`	Observations → criteria	Inductive reasoning (bottom-up)
Corpus	`DeductiveGenerator`	Dimensions → criteria	Deductive reasoning (top-down)
Corpus	`InteractiveGenerator`	Eval sessions → criteria	Protocol analysis

Instance-level generators produce one checklist per input — criteria are tailored to each specific task. Corpus-level generators produce one checklist for an entire dataset — criteria capture general quality patterns derived from higher-level signals.

Each generator is customizable via prompt templates (.md files with {input}, {target} placeholders). You can use the built-in paper implementations, write your own prompts, or chain generators with different refiners and scorers to build custom evaluation pipelines.

Built-in Pipelines

The library includes built-in pipelines implementing methods from research papers (TICK, RocketEval, RLCF, CheckEval, InteractEval, and more). See Supported Pipelines for the full list and configuration details.

Scoring

A single configurable ChecklistScorer class supports all scoring modes:

Config	Description
`mode="batch"`	All items in one LLM call (efficient)
`mode="batch", capture_reasoning=True`	Batch with per-item explanations
`mode="item"`	One item per call
`mode="item", capture_reasoning=True`	One item per call with reasoning
`mode="item", primary_metric="weighted"`	Item weights (0-100) for importance
`mode="item", use_logprobs=True`	Logprob confidence calibration

Refiners

Refiners are pipeline stages that clean up raw checklists before scoring. They're used by corpus-level generators internally, and can also be composed into custom pipelines:

Deduplicator — merges semantically similar items via embeddings
Tagger — filters by applicability and specificity
UnitTester — validates that items are enforceable
Selector — picks a diverse subset via beam search

Using the Package

Custom Prompts

Write a prompt template and generate a checklist:

from autochecklist import DirectGenerator, ChecklistScorer

gen = DirectGenerator(
    custom_prompt="You are an expert evaluator. Generate yes/no checklist questions to score:\n\n{input}",
    model="openai/gpt-5-mini",
)
checklist = gen.generate(input="Write a haiku about autumn.")

scorer = ChecklistScorer(mode="batch", model="openai/gpt-5-mini")
score = scorer.score(checklist, target="Leaves fall gently down...")
print(f"Pass rate: {score.pass_rate:.0%}")

Scorers also take custom prompts. Prompts can also be loaded from .md files — see Custom Prompts for the full guide (placeholders, custom scorers, registration).

Custom Pipelines

Register a custom pipeline (generator + scorer + prompts) as a reusable unit:

from autochecklist import register_custom_pipeline, pipeline

# Register from config
register_custom_pipeline(
    "my_eval",
    generator_prompt="Generate yes/no questions for:\n\n{input}",
    scorer="weighted",
)
pipe = pipeline("my_eval", generator_model="openai/gpt-5-mini")

# Or register from an existing pipeline instance
register_custom_pipeline("my_eval_v2", pipe)

# Save/load pipeline configs as JSON
from autochecklist import save_pipeline_config, load_pipeline_config
save_pipeline_config("my_eval", "my_eval.json")
load_pipeline_config("my_eval.json")  # registers and returns the name

Built-in Pipelines

The library includes pipelines implementing methods from research papers. Use them via method_name or the pipeline() shorthand:

from autochecklist import pipeline

pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="Write a haiku about autumn", target="Leaves fall gently...")
print(f"Pass rate: {result.pass_rate:.0%}")

See Supported Pipelines for the full list of pipelines, paper details, and configuration options.

Batch Evaluation

data = [
    {"input": "Write a haiku", "target": "Leaves fall..."},
    {"input": "Write a limerick", "target": "There once was..."},
]
result = pipe.run_batch(data, show_progress=True)
print(f"Macro pass rate: {result.macro_pass_rate:.0%}")

For pipeline composition, provider configuration, and the full API, see the Pipeline Guide.

Command-Line Interface

Run evaluations directly from the terminal:

# Full evaluation (generate + score)
autochecklist run --pipeline tick --data eval_data.jsonl -o results.jsonl \
  --generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-mini

# Generate checklists only
autochecklist generate --pipeline tick --data inputs.jsonl -o checklists.jsonl \
  --generator-model openai/gpt-4o-mini

# Score with existing checklist
autochecklist score --data eval_data.jsonl --checklist checklist.json \
  -o results.jsonl --scorer-model openai/gpt-4o-mini

# List available pipelines
autochecklist list

API keys can be set via --api-key, environment variables (OPENROUTER_API_KEY), or a .env file. See the CLI Guide for full details.

Examples

Detailed examples with runnable code:

custom_components_tutorial.ipynb - Create your own generators, scorers, and refiners
pipeline_demo.ipynb - Pipeline API, registry, batch evaluation, export
instance_level_demo.ipynb - DirectGenerator, ContrastiveGenerator (per-input checklists)
corpus_level_demo.ipynb - InductiveGenerator, DeductiveGenerator, InteractiveGenerator (per-dataset checklists)

UI

A web interface for demonstrating autochecklist methods. See ui/README.md for details.

Quick Start:

cd ui
./launch_ui.sh
# Frontend: http://localhost:7860
# Backend:  http://localhost:7861

Testing

Warning

Integration tests make real LLM API calls that incur costs. The default fast tests (-m 'not integration and not vllm_offline and not openai_api') are free and use no external services. Only run integration tests if you have the required API keys set and are okay with the associated costs.

# Core library fast tests (recommended default — no API calls)
uv run pytest -v -rs tests --ignore=ui/backend/tests -m 'not integration and not vllm_offline and not openai_api'

# Core integration tests (real API calls — requires OPENROUTER_API_KEY)
uv run pytest -v -rs tests --ignore=ui/backend/tests -m integration

# Backend API tests (no API calls needed)
cd ui/backend
uv run pytest -v -rs tests

See tests/README.md and ui/backend/tests/README.md for details.

Automatic Documentation

Preview the API docs locally with MkDocs:

uv run mkdocs serve                       # http://localhost:8000
uv run mkdocs serve -a localhost:7772      # custom port
uv run mkdocs serve -a 0.0.0.0:7772       # expose on network (e.g., http://your-server:7772)

API reference is auto-generated from docstrings via mkdocs-autoapi. New modules are discovered automatically — no manual page creation needed.

Citation

TBA

License

Apache-2.0 (see LICENSE)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
autochecklist		autochecklist
docs		docs
examples		examples
logos		logos
tests		tests
ui		ui
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.pypi.md		README.pypi.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Installation

`pip`

From source:

Concepts

Checklist Generator Abstractions

Built-in Pipelines

Scoring

Refiners

Using the Package

Custom Prompts

Custom Pipelines

Built-in Pipelines

Batch Evaluation

Command-Line Interface

Examples

UI

Testing

Automatic Documentation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Features

Installation

pip

From source:

Concepts

Checklist Generator Abstractions

Built-in Pipelines

Scoring

Refiners

Using the Package

Custom Prompts

Custom Pipelines

Built-in Pipelines

Batch Evaluation

Command-Line Interface

Examples

UI

Testing

Automatic Documentation

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`pip`

Packages