GitHub - husayni/gsm-u: Novel benchmark for underspecified queries

GSM‑U: Generation and Benchmarking

(paper coming soon)

End‑to‑end tools to

Generate underspecified math problems from GSM8K
Benchmark models on an oracle‑clarification task.

Quick Start

Python 3.10+
Install package (editable):

pip install -e .

Set OpenRouter API key (or copy .env.example to .env):

export OPENROUTER_API_KEY=your_key_here

1) Generate the GSM‑U Dataset

Command:

python -m gsm_u.generate \
  --input openai/gsm8k \
  --data-split train \
  --output output/dataset/underspecified.jsonl \
  --n 2000 \
  --underspec-model google/gemini-2.5-flash \
  --verifier-model-a openai/gpt-5-mini \
  --verifier-model-b google/gemini-2.5-flash \
  --clarifier-model google/gemini-2.5-flash

Notes:

--input accepts either a local JSONL path or a Hugging Face dataset id (default: openai/gsm8k).
When using a dataset id, --data-split chooses train or test.
Output files:
- Main items: output/dataset/underspecified.jsonl
- Raw agent outputs: output/dataset/underspecified.jsonl.debug.jsonl

Expected input fields when providing a local JSONL: id, question, answer.

2) Run the Oracle‑Clarification Benchmark

Prepare a dataset of underspecified items (e.g., from the generation step or dataset/gsm-u.jsonl). Then:

python -m gsm_u.benchmark \
  --model openai/gpt-4o-mini \
  --dataset-path dataset/gsm-u.jsonl \
  --max-turns 10 \
  --out-dir results

Outputs (under results/run_<timestamp>_<model>/):

results.jsonl: per‑item logs and outcomes
metrics.json: aggregate metrics (success rate, coverage@1, precision@1, etc.)

Configuration

Models: any OpenRouter model id is accepted, e.g. openai/gpt-4o-mini, google/gemini-2.5-flash, deepseek/deepseek-r1.
Environment: OPENROUTER_API_KEY must be set; default base URL is https://openrouter.ai/api/v1.

Data Schema and Guards

Generation produces items with:
- underspecified_question (str)
- missing_informations (list[str], length 1–2)
- number_of_information_reduced (int, equals list length)
Numeric guard ensures no new numbers are introduced in the underspecified question.

Tips

Writes are incremental; you can stop/resume safely. Check the .debug.jsonl file to inspect raw model outputs.
Items marked solvable by any verifier are skipped by design.

License

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
gsm_u		gsm_u
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GSM‑U: Generation and Benchmarking

Quick Start

1) Generate the GSM‑U Dataset

2) Run the Oracle‑Clarification Benchmark

Configuration

Data Schema and Guards

Tips

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GSM‑U: Generation and Benchmarking

Quick Start

1) Generate the GSM‑U Dataset

2) Run the Oracle‑Clarification Benchmark

Configuration

Data Schema and Guards

Tips

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages