Skip to content

husayni/gsm-u

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GSM‑U: Generation and Benchmarking

(paper coming soon)

End‑to‑end tools to

  1. Generate underspecified math problems from GSM8K
  2. Benchmark models on an oracle‑clarification task.

Quick Start

  • Python 3.10+
  • Install package (editable):
pip install -e .
  • Set OpenRouter API key (or copy .env.example to .env):
export OPENROUTER_API_KEY=your_key_here

1) Generate the GSM‑U Dataset

Command:

python -m gsm_u.generate \
  --input openai/gsm8k \
  --data-split train \
  --output output/dataset/underspecified.jsonl \
  --n 2000 \
  --underspec-model google/gemini-2.5-flash \
  --verifier-model-a openai/gpt-5-mini \
  --verifier-model-b google/gemini-2.5-flash \
  --clarifier-model google/gemini-2.5-flash

Notes:

  • --input accepts either a local JSONL path or a Hugging Face dataset id (default: openai/gsm8k).
  • When using a dataset id, --data-split chooses train or test.
  • Output files:
    • Main items: output/dataset/underspecified.jsonl
    • Raw agent outputs: output/dataset/underspecified.jsonl.debug.jsonl

Expected input fields when providing a local JSONL: id, question, answer.

2) Run the Oracle‑Clarification Benchmark

Prepare a dataset of underspecified items (e.g., from the generation step or dataset/gsm-u.jsonl). Then:

python -m gsm_u.benchmark \
  --model openai/gpt-4o-mini \
  --dataset-path dataset/gsm-u.jsonl \
  --max-turns 10 \
  --out-dir results

Outputs (under results/run_<timestamp>_<model>/):

  • results.jsonl: per‑item logs and outcomes
  • metrics.json: aggregate metrics (success rate, coverage@1, precision@1, etc.)

Configuration

  • Models: any OpenRouter model id is accepted, e.g. openai/gpt-4o-mini, google/gemini-2.5-flash, deepseek/deepseek-r1.
  • Environment: OPENROUTER_API_KEY must be set; default base URL is https://openrouter.ai/api/v1.

Data Schema and Guards

  • Generation produces items with:
    • underspecified_question (str)
    • missing_informations (list[str], length 1–2)
    • number_of_information_reduced (int, equals list length)
  • Numeric guard ensures no new numbers are introduced in the underspecified question.

Tips

  • Writes are incremental; you can stop/resume safely. Check the .debug.jsonl file to inspect raw model outputs.
  • Items marked solvable by any verifier are skipped by design.

License

See LICENSE for details.

About

Novel benchmark for underspecified queries

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages