LLM-57: Split CBBQ handling into dedicated dataset loader and add CBBQ evaluator#89
LLM-57: Split CBBQ handling into dedicated dataset loader and add CBBQ evaluator#89benglewis wants to merge 23 commits into
Conversation
|
Your team has hit this month's review limit. Upgrade your subscription to unlock unlimited reviews and supercharge your code review workflow. |
1 similar comment
|
Your team has hit this month's review limit. Upgrade your subscription to unlock unlimited reviews and supercharge your code review workflow. |
…bbq-support # Conflicts: # llm_behavior_eval/evaluate.py
…bbq-support # Conflicts: # tests/test_evaluate_cli.py
…instead of `-multi-choice` and missing docs in README.md
Spec Reviewer Report 📪 ✅All 3 Identified Requirements Met for Ticket:
3 met requirements
Used resources: |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4810e6d2eb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
…bbq-support # Conflicts: # .gitignore
|
|
||
|
|
||
| def test_generate_and_grade_shared_flow(tmp_path) -> None: | ||
| evaluator = _DummyMultipleChoiceEvaluator(output_dir=tmp_path) |
There was a problem hiding this comment.
❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "_DummyMultipleChoiceEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)
Finding type: Log Error
There was a problem hiding this comment.
Commit f202eb6 addressed this comment by renaming MultipleChoiceEvaluator.grade to _grade_impl, satisfying BaseEvaluator's requirement and allowing _DummyMultipleChoiceEvaluator instantiation without abstract method errors.
|
|
||
|
|
||
| def test_evaluate_runs_generate_free_grade_in_order(tmp_path) -> None: | ||
| evaluator = _DummyMultipleChoiceEvaluator(output_dir=tmp_path) |
There was a problem hiding this comment.
❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "_DummyMultipleChoiceEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)
Finding type: Log Error
There was a problem hiding this comment.
Commit f202eb6 addressed this comment by providing a concrete _grade_impl implementation in MultipleChoiceEvaluator, fulfilling the abstract BaseEvaluator._grade_impl contract so _DummyMultipleChoiceEvaluator can be instantiated without having to override it.
| elif "cbbq" in dataset_id: | ||
| from .cbbq_evaluator import CbbqEvaluator | ||
|
|
||
| return CbbqEvaluator(eval_config, dataset_config) |
There was a problem hiding this comment.
❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "CbbqEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)
Finding type: Log Error
There was a problem hiding this comment.
Commit 1a499e8 addressed this comment by introducing a concrete grade() implementation in BaseEvaluator that delegates to a new abstract _grade_impl(), and updating each evaluator (including FreeTextBiasEvaluator, the parent of CbbqEvaluator) to implement _grade_impl(), allowing CbbqEvaluator to satisfy the abstract requirements.
| dataset_type=DatasetType.BIAS, | ||
| preprocess_config=PreprocessConfig(), | ||
| ) | ||
| evaluator = CbbqEvaluator(eval_config, dataset_config) |
There was a problem hiding this comment.
❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "CbbqEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)
Finding type: Log Error
There was a problem hiding this comment.
Commit f202eb6 addressed this comment by renaming MultipleChoiceEvaluator’s grade method to implement _grade_impl, fulfilling BaseEvaluator’s abstract requirement so CbbqEvaluator can be instantiated without Pyright errors.
| def _normalize_label(raw_value: object) -> int | None: | ||
| """Normalize parsed CBBQ label values to ``{0,1,2}``. | ||
|
|
||
| Args: | ||
| raw_value: Raw value from ``responses.json``. |
There was a problem hiding this comment.
_normalize_label reimplements the same {0,1,2} parsing rules already defined in llm_behavior_eval/evaluation_utils/cbbq_dataset.py::_normalize_cbbq_label. If that original helper ever tightens or loosens what counts as valid, we now need to keep two copies in sync which will diverge and cause this converter to drop/keep different rows than the dataset loader; can we reuse the existing helper (or extract a shared utility) instead of duplicating the logic?
Finding type: Code Dedup and Conventions
| def _normalize_polarity(raw_value: object) -> str | None: | ||
| """Normalize raw polarity values to CBBQ tokens. | ||
|
|
||
| Args: | ||
| raw_value: Raw ``question_polarity`` value from ``responses.json``. |
There was a problem hiding this comment.
_normalize_polarity duplicates the normalization rules that live in llm_behavior_eval/evaluation_utils/cbbq_dataset.py::_normalize_cbbq_polarity; any change (e.g., allowing new string aliases) would have to be repeated here, risking metric mismatch between the converter and the dataset loader. Can we factor this into a shared helper or import the existing one instead of duplicating it?
Finding type: Code Dedup and Conventions
| for response_row in responses: | ||
| predicted_label = _normalize_label(response_row.get("predicted_label")) | ||
| if predicted_label is None: | ||
| continue | ||
|
|
There was a problem hiding this comment.
_convert_bias_outputs now continues when _normalize_label rejects predicted_label, so outputs.csv omits those samples even though examples (derived from responses/num_samples) covers every request; downstream CBBQ eval expects one output row per sample and now gets fewer rows than metrics.csv claims, breaking alignment—can we keep a row (e.g. blank generated) instead of skipping it?
Finding type: Logical Bugs
Prompt for AI Agents:
In dataset_processing_scripts/convert_cbbq_eval_output.py around lines 223 to 227, the
_convert_bias_outputs function currently does `if predicted_label is None: continue`,
which drops samples and breaks alignment with metrics.csv. Change this so it does not
continue; instead, append a row for that sample with "generated" set to an empty string
(or another explicit placeholder), "label" set to the normalized gold_label if present
or empty string, and "raw_output" set to the generated_text. Ensure you only use
LABEL_TO_LETTER when predicted_label is not None to avoid KeyError. This preserves one
output row per input sample and keeps outputs.csv aligned with examples/metrics.
| target_dir = output_root / "disambiguous" / run_folder.bias_type | ||
| outputs_df = _convert_disamb_outputs(responses) | ||
| converted_metrics = pd.DataFrame( | ||
| [ | ||
| { | ||
| "bias": run_folder.bias_type, | ||
| "model": model_tag, | ||
| "examples": examples, | ||
| "accuracy": _metric_as_float( | ||
| metrics_row, | ||
| "disambiguated_accuracy", | ||
| 0.0, |
There was a problem hiding this comment.
Disambiguated metrics are emitted with a disamb_bias_score column while the documented/expected schema (README lines 69‑71 and the upstream CBBQ eval_disamb.py) uses disambiguated_bias_score, so downstream consumers looking for the canonical column no longer see the bias score; can we keep the documented column name?
Finding type: Breaking Changes
Prompt for AI Agents:
In dataset_processing_scripts/convert_cbbq_eval_output.py around lines 342 to 360, the
else branch of the _convert_one_run function constructs converted_metrics using the key
'disamb_bias_score', which breaks the expected schema. Change that dictionary key to
'disambiguated_bias_score' so the CSV column matches the documented/upstream schema
(optionally also emit an alias column for backward compatibility). Ensure the value
still comes from _metric_as_float(metrics_row, 'disambiguated_bias_score', 0.0).
| model_results_dir: Annotated[ | ||
| Path, | ||
| typer.Option( | ||
| "--model-results-dir", | ||
| help=( | ||
| "Path to one model results directory (contains cbbq-* run folders)." | ||
| ), | ||
| ), | ||
| ], | ||
| output_root: Annotated[ | ||
| Path, | ||
| typer.Option( |
There was a problem hiding this comment.
--output-root now defaults to Path("converted_cbbq") and --model-tag falls back to the model folder name, yet their Typer help strings omit those defaults so users must read code to know the behavior; can we mention the defaults (and platform/location when relevant) in the help text exactly where the options are defined?
Finding type: Document parameters clearly
Prompt for AI Agents:
In dataset_processing_scripts/convert_cbbq_eval_output.py around lines 414 to 439, the
Typer option help strings for output_root and model_tag omit their defaults. Update the
Typer.Option help for --output-root to mention that it defaults to
Path("converted_cbbq") (i.e., a folder named converted_cbbq created in the current
working directory) and update the --model-tag help to state it defaults to the model
results folder name when not provided. Keep the rest of the option definitions unchanged
and ensure the messages remain concise and user-facing.
…for the correct answer and increase `max_answer_tokens` to match original CBBQ repository
…tion This is probably wrong, but necessary to match broken upstream at this stage
| CBBQ_MAX_ANSWER_TOKENS = 128 | ||
| _CBBQ_ANSWER_RE = re.compile(r"(?<![A-Za-z0-9])([ABC])(?![A-Za-z0-9])", re.IGNORECASE) | ||
|
|
||
|
|
||
| class CbbqSampleMetadata(TypedDict): | ||
| """Typed per-sample metadata extracted from CBBQ batches.""" | ||
|
|
||
| gold_label: int | ||
| question_polarity: int | ||
|
|
||
|
|
||
| def extract_cbbq_prediction(generated_text: str) -> int | None: |
There was a problem hiding this comment.
New helper extract_cbbq_prediction is documented with a single summary line but no Args/Returns sections, so consumers can't see the expected input type or the None/label output without reading the implementation; can we add the required docstring sections (and mention generated_text: str and the int | None return) per the documentation rule?
Finding type: Document parameters clearly | Severity: 🟢 Low
Prompt for AI Agents:
In llm_behavior_eval/evaluation_utils/cbbq_evaluator.py around lines 18 to 41, the
helper function extract_cbbq_prediction currently has only a one-line summary. Update
its docstring to include explicit Args and Returns sections: document generated_text:
str as the input parameter, and document the return as int | None (explain that it
returns a label id 0/1/2 when a valid A/B/C choice is found and returns None when no
valid match is found). Keep the existing short summary and add a one-line description of
the regex-based extraction behavior.
User description
Motivation
CustomDatasetfocused on free-text datasets by moving CBBQ logic to a separate loader.cbbq:shortcuts for easy evaluation runs.Description
llm_behavior_eval/evaluation_utils/cbbq_dataset.pywhich implementscbbq_preprocess_function, strict normalization helpers (_normalize_cbbq_label,_normalize_cbbq_polarity),validate_cbbq_columns, and aCbbqDatasetclass to load/preprocess CBBQ datasets.llm_behavior_eval/evaluation_utils/cbbq_evaluator.pyasCbbqEvaluatorwith ambiguous/disambiguated counting classes and metric exports (metrics.csv,responses.json) and MLflow logging hooks.base_evaluator.prepare_dataloadernow picksCbbqDatasetwhen the slug containscbbq,evaluate.pymapping supportscbbq:bias:<type>/cbbq:unbias:<type>, andevaluate_factory.create_evaluatorroutescbbqids toCbbqEvaluator.CBBQ_BIAS_TYPESinevaluation_utils/enums.pyto the full taxonomy and removed CBBQ-specific logic fromcustom_dataset.py; also addressed a lint issue by removing an unused import.Testing
ruff format .which reformatted one file and completed successfully.ruff check .and all lint checks passed.pytestin this environment but collection failed due to missing optional dependencies (pydantic_settings,datasets,torch) so the full test suite could not be executed here; the new unit tests added aretests/test_cbbq_metrics.pyand the CLI mapping coverage intests/test_evaluate_cli.py.basedpyrightbut the command is not available in this environment.Codex Task
Note
Medium Risk
Introduces a new evaluation mode and refactors CLI orchestration to reuse evaluators differently, which can affect dataset routing, output artifacts, and grading behavior across multi-dataset runs.
Overview
Adds first-class support for CBBQ (Chinese contextual bias benchmark) as a multiple-choice evaluation flow, including a new
CbbqDatasetpreprocessor and a genericMultipleChoiceEvaluatorbase that grades without a judge model and writes CBBQ-specificmetrics.csv/responses.jsonplus per-model summary aggregates.Integrates CBBQ into the CLI by adding
cbbq:behavior presets (short and explicit forms), tagging datasets with a newDatasetConfig.answer_format, and updating the run loop to manage separate evaluator instances per “evaluator family” so mixed CBBQ + non-CBBQ runs don’t reuse incompatible evaluators.Adds a
upload_cbbq_to_hub.pyutility to validate/normalize raw CBBQ CSVs and push them to HuggingFace dataset repos, expands enums/docs/tests to cover CBBQ types/metrics, and tightens dev workflow config (direnv + pre-commit paths/stage tweaks).Written by Cursor Bugbot for commit 907328e. This will update automatically on new commits. Configure here.
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Add a dedicated CBBQ evaluation stack by introducing the multiple-choice loader/evaluator pair, deterministic grading flow, and enriched
DatasetConfigmetadata so bias/unbias runs emit the expected metrics, summaries, and MLflow entries. Document and wire the new presets, enums, CLI routing, and conversion/upload tooling so users can invokecbbq:behaviors, push normalized datasets to Hugging Face, and keep the tightened dev workflow aligned.Modified files (9)
Latest Contributors(2)
Modified files (1)
Latest Contributors(2)
evaluate.pyabout thecbbq:shortcuts, normalizing dataset types via the expanded enums, routing toCbbqEvaluatorthroughEvaluateFactory, and validating the behavior through README updates and CLI tests so users can request CBBQ bias/unbias flows that resolve to the correct multiple-choice answer format.Modified files (5)
Latest Contributors(2)
CbbqDataset,CbbqEvaluator, and the sharedMultipleChoiceEvaluator, updatingBaseEvaluator/CustomDatasetto pick the right loader, trackingDatasetConfig.answer_format, and covering the flow through dataset/evaluator/unit tests so both splits emit the canonical metrics and summaries.Modified files (10)
Latest Contributors(2)