(paper coming soon)
End‑to‑end tools to
- Generate underspecified math problems from GSM8K
- Benchmark models on an oracle‑clarification task.
- Python 3.10+
- Install package (editable):
pip install -e .
- Set OpenRouter API key (or copy
.env.exampleto.env):
export OPENROUTER_API_KEY=your_key_here
Command:
python -m gsm_u.generate \
--input openai/gsm8k \
--data-split train \
--output output/dataset/underspecified.jsonl \
--n 2000 \
--underspec-model google/gemini-2.5-flash \
--verifier-model-a openai/gpt-5-mini \
--verifier-model-b google/gemini-2.5-flash \
--clarifier-model google/gemini-2.5-flash
Notes:
--inputaccepts either a local JSONL path or a Hugging Face dataset id (default:openai/gsm8k).- When using a dataset id,
--data-splitchoosestrainortest. - Output files:
- Main items:
output/dataset/underspecified.jsonl - Raw agent outputs:
output/dataset/underspecified.jsonl.debug.jsonl
- Main items:
Expected input fields when providing a local JSONL: id, question, answer.
Prepare a dataset of underspecified items (e.g., from the generation step or dataset/gsm-u.jsonl). Then:
python -m gsm_u.benchmark \
--model openai/gpt-4o-mini \
--dataset-path dataset/gsm-u.jsonl \
--max-turns 10 \
--out-dir results
Outputs (under results/run_<timestamp>_<model>/):
results.jsonl: per‑item logs and outcomesmetrics.json: aggregate metrics (success rate, coverage@1, precision@1, etc.)
- Models: any OpenRouter model id is accepted, e.g.
openai/gpt-4o-mini,google/gemini-2.5-flash,deepseek/deepseek-r1. - Environment:
OPENROUTER_API_KEYmust be set; default base URL ishttps://openrouter.ai/api/v1.
- Generation produces items with:
underspecified_question(str)missing_informations(list[str], length 1–2)number_of_information_reduced(int, equals list length)
- Numeric guard ensures no new numbers are introduced in the underspecified question.
- Writes are incremental; you can stop/resume safely. Check the
.debug.jsonlfile to inspect raw model outputs. - Items marked solvable by any verifier are skipped by design.
See LICENSE for details.