AutoChecklist is an open-source library that unifies LLM-based checklist evaluation into composable pipelines, in a pip-installable Python package (autochecklist) with CLI and UI features.
- Five checklist generator abstractions that organize methods from research by their reasoning strategies for deriving evaluation criteria
- Composable pipelines ten built-in configurations implementing published methods, compatible with a unified scorer that consolidates three scoring strategies from literature
- CLI for off-the-shelf evaluation with pre-defined pipelines
- Multi-provider LLM backend with support for OpenAI, OpenRouter, and vLLM
- Locally hosted UI allowing for interactive prompt customization, pipeline configuration, and batch evaluation
uv pip install autochecklist
# Optional: install vLLM for offline inference (needs GPU)
uv pip install "autochecklist[vllm]"# Clone the repository
git clone https://github.com/ChicagoHAI/AutoChecklist.git
cd AutoChecklist
# Install dependencies
uv sync
# Set up environment
cp .env.example .env
# Edit .env and add your API key(s):
# OPENROUTER_API_KEY=sk-or-... (for OpenRouter, the default provider)
# OPENAI_API_KEY=sk-... (for direct OpenAI or corpus-level embeddings)To use in another project:
uv pip install -e /path/to/AutoChecklistTip
Some terminology:
input: The instruction, query, or task given to the LLM being evaluated (e.g., "Write a haiku about autumn").target: The output being evaluated against the checklist (e.g., the haiku the LLM produced).reference: An optional gold-standard response used by some methods to improve checklist generation.
The core of the library is 5 generator classes, each implementing a distinct approach to producing checklists:
| Level | Generator | Approach | Analogy |
|---|---|---|---|
| Instance | DirectGenerator |
Prompt → checklist | Direct inference |
| Instance | ContrastiveGenerator |
Candidates → checklist | Counterfactual reasoning |
| Corpus | InductiveGenerator |
Observations → criteria | Inductive reasoning (bottom-up) |
| Corpus | DeductiveGenerator |
Dimensions → criteria | Deductive reasoning (top-down) |
| Corpus | InteractiveGenerator |
Eval sessions → criteria | Protocol analysis |
Instance-level generators produce one checklist per input — criteria are tailored to each specific task. Corpus-level generators produce one checklist for an entire dataset — criteria capture general quality patterns derived from higher-level signals.
Each generator is customizable via prompt templates (.md files with {input}, {target} placeholders). You can use the built-in paper implementations, write your own prompts, or chain generators with different refiners and scorers to build custom evaluation pipelines.
The library includes built-in pipelines implementing methods from research papers (TICK, RocketEval, RLCF, OpenRubrics, CheckEval, InteractEval, and more). See Supported Pipelines for the full list and configuration details.
A single configurable ChecklistScorer class supports all scoring modes:
| Config | Description |
|---|---|
mode="batch" |
All items in one LLM call (efficient) |
mode="batch", capture_reasoning=True |
Batch with per-item explanations |
mode="item" |
One item per call |
mode="item", capture_reasoning=True |
One item per call with reasoning |
mode="item", primary_metric="weighted" |
Item weights (0-100) for importance |
mode="item", use_logprobs=True |
Logprob confidence calibration |
Refiners are pipeline stages that clean up raw checklists before scoring. They're used by corpus-level generators internally, and can also be composed into custom pipelines:
- Deduplicator — merges semantically similar items via embeddings
- Tagger — filters by applicability and specificity
- UnitTester — validates that items are enforceable
- Selector — picks a diverse subset via beam search
from autochecklist import pipeline
pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="Write a haiku about autumn.", target="Leaves fall gently down...")
print(f"Pass rate: {result.pass_rate:.0%}")See the Quick Start guide for custom prompts, batch evaluation, and more.
autochecklist run --pipeline tick --data eval_data.jsonl -o results.jsonl \
--generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-miniSee the CLI guide for all commands.
A web interface for demonstrating autochecklist methods. See ui/README.md for details.
autochecklist ui # or: cd ui && ./launch_ui.sh
autochecklist ui --dev # development mode (hot-reload)The
uisubcommand is only available from a source checkout.
Warning
Integration tests make real LLM API calls that incur costs. The default fast tests (-m 'not integration and not vllm_offline and not openai_api') are free and use no external services. Only run integration tests if you have the required API keys set and are okay with the associated costs.
# Core library fast tests (recommended default — no API calls)
uv run pytest -v -rs tests --ignore=ui/backend/tests -m 'not integration and not vllm_offline and not openai_api'
# Core integration tests (real API calls — requires OPENROUTER_API_KEY)
uv run pytest -v -rs tests --ignore=ui/backend/tests -m integration
# Backend API tests (no API calls needed)
cd ui/backend
uv run pytest -v -rs testsSee tests/README.md and ui/backend/tests/README.md for details.
Preview the API docs locally with MkDocs:
uv run mkdocs serve # http://localhost:8000
uv run mkdocs serve -a localhost:7772 # custom port
uv run mkdocs serve -a 0.0.0.0:7772 # expose on network (e.g., http://your-server:7772)API reference is auto-generated from docstrings via mkdocs-autoapi. New modules are discovered automatically — no manual page creation needed.
TBA
Apache-2.0 (see LICENSE)


