Skip to content

LLM-57: Split CBBQ handling into dedicated dataset loader and add CBBQ evaluator#89

Open
benglewis wants to merge 23 commits into
mainfrom
codex/2026-01-12/linear-mention-llm-57-add-cbbq-support
Open

LLM-57: Split CBBQ handling into dedicated dataset loader and add CBBQ evaluator#89
benglewis wants to merge 23 commits into
mainfrom
codex/2026-01-12/linear-mention-llm-57-add-cbbq-support

Conversation

@benglewis
Copy link
Copy Markdown
Contributor

@benglewis benglewis commented Jan 12, 2026

User description

Motivation

  • Provide first-class preprocessing for CBBQ multiple-choice inputs with stricter typing and validation so downstream evaluators can compute CBBQ-specific metrics.
  • Keep CustomDataset focused on free-text datasets by moving CBBQ logic to a separate loader.
  • Route CBBQ dataset ids to a dedicated evaluator that emits the CBBQ metrics and artifacts.
  • Ensure the CLI/presets support cbbq: shortcuts for easy evaluation runs.

Description

  • Added llm_behavior_eval/evaluation_utils/cbbq_dataset.py which implements cbbq_preprocess_function, strict normalization helpers (_normalize_cbbq_label, _normalize_cbbq_polarity), validate_cbbq_columns, and a CbbqDataset class to load/preprocess CBBQ datasets.
  • Implemented llm_behavior_eval/evaluation_utils/cbbq_evaluator.py as CbbqEvaluator with ambiguous/disambiguated counting classes and metric exports (metrics.csv, responses.json) and MLflow logging hooks.
  • Updated evaluator selection and dataset loading: base_evaluator.prepare_dataloader now picks CbbqDataset when the slug contains cbbq, evaluate.py mapping supports cbbq:bias:<type> / cbbq:unbias:<type>, and evaluate_factory.create_evaluator routes cbbq ids to CbbqEvaluator.
  • Corrected CBBQ_BIAS_TYPES in evaluation_utils/enums.py to the full taxonomy and removed CBBQ-specific logic from custom_dataset.py; also addressed a lint issue by removing an unused import.

Testing

  • Ran ruff format . which reformatted one file and completed successfully.
  • Ran ruff check . and all lint checks passed.
  • Ran pytest in this environment but collection failed due to missing optional dependencies (pydantic_settings, datasets, torch) so the full test suite could not be executed here; the new unit tests added are tests/test_cbbq_metrics.py and the CLI mapping coverage in tests/test_evaluate_cli.py.
  • Attempted basedpyright but the command is not available in this environment.

Codex Task


Note

Medium Risk
Introduces a new evaluation mode and refactors CLI orchestration to reuse evaluators differently, which can affect dataset routing, output artifacts, and grading behavior across multi-dataset runs.

Overview
Adds first-class support for CBBQ (Chinese contextual bias benchmark) as a multiple-choice evaluation flow, including a new CbbqDataset preprocessor and a generic MultipleChoiceEvaluator base that grades without a judge model and writes CBBQ-specific metrics.csv/responses.json plus per-model summary aggregates.

Integrates CBBQ into the CLI by adding cbbq: behavior presets (short and explicit forms), tagging datasets with a new DatasetConfig.answer_format, and updating the run loop to manage separate evaluator instances per “evaluator family” so mixed CBBQ + non-CBBQ runs don’t reuse incompatible evaluators.

Adds a upload_cbbq_to_hub.py utility to validate/normalize raw CBBQ CSVs and push them to HuggingFace dataset repos, expands enums/docs/tests to cover CBBQ types/metrics, and tightens dev workflow config (direnv + pre-commit paths/stage tweaks).

Written by Cursor Bugbot for commit 907328e. This will update automatically on new commits. Configure here.


Generated description

Below is a concise technical summary of the changes proposed in this PR:
Add a dedicated CBBQ evaluation stack by introducing the multiple-choice loader/evaluator pair, deterministic grading flow, and enriched DatasetConfig metadata so bias/unbias runs emit the expected metrics, summaries, and MLflow entries. Document and wire the new presets, enums, CLI routing, and conversion/upload tooling so users can invoke cbbq: behaviors, push normalized datasets to Hugging Face, and keep the tightened dev workflow aligned.

TopicDetails
Tooling & dataset publishing Harden tooling, dataset publishing, and developer workflows by refreshing the environment guidance, dependency locking, and git/direnv hooks, adding the CBBQ conversion/upload scripts (with their tests) plus helpful import safety checks, so reproducing/pushing CBBQ data and running checks works reliably.
Modified files (9)
  • .envrc
  • .gitignore
  • .pre-commit-config.yaml
  • AGENTS.md
  • dataset_processing_scripts/convert_cbbq_eval_output.py
  • dataset_processing_scripts/upload_cbbq_to_hub.py
  • pyproject.toml
  • tests/test_upload_cbbq_to_hub.py
  • tests/test_util_functions.py
Latest Contributors(2)
UserCommitDate
orr@hirundo.ioLLM-75-Fix-Result-Dire...March 01, 2026
blewis@hirundo.ioLLM-38-Create-Developm...November 18, 2025
Other Other files
Modified files (1)
  • uv.lock
Latest Contributors(2)
UserCommitDate
orr@hirundo.ioLLM-74-Upgrade-vllm-Re...February 23, 2026
github-actions[bot]Bump-version-to-0.1.6b...February 19, 2026
CLI & docs Integrate CLI routing/presets by teaching evaluate.py about the cbbq: shortcuts, normalizing dataset types via the expanded enums, routing to CbbqEvaluator through EvaluateFactory, and validating the behavior through README updates and CLI tests so users can request CBBQ bias/unbias flows that resolve to the correct multiple-choice answer format.
Modified files (5)
  • README.md
  • llm_behavior_eval/evaluate.py
  • llm_behavior_eval/evaluation_utils/enums.py
  • llm_behavior_eval/evaluation_utils/evaluate_factory.py
  • tests/test_evaluate_cli.py
Latest Contributors(2)
UserCommitDate
orr@hirundo.ioLLM-75-Fix-Result-Dire...March 01, 2026
mishana4life@gmail.comLLM-64-Change-results-...February 16, 2026
CBBQ evaluation core Build the new CBBQ evaluation core by adding CbbqDataset, CbbqEvaluator, and the shared MultipleChoiceEvaluator, updating BaseEvaluator/CustomDataset to pick the right loader, tracking DatasetConfig.answer_format, and covering the flow through dataset/evaluator/unit tests so both splits emit the canonical metrics and summaries.
Modified files (10)
  • llm_behavior_eval/evaluation_utils/base_evaluator.py
  • llm_behavior_eval/evaluation_utils/cbbq_dataset.py
  • llm_behavior_eval/evaluation_utils/cbbq_evaluator.py
  • llm_behavior_eval/evaluation_utils/custom_dataset.py
  • llm_behavior_eval/evaluation_utils/dataset_config.py
  • llm_behavior_eval/evaluation_utils/multiple_choice_evaluator.py
  • tests/test_cbbq_dataset.py
  • tests/test_cbbq_metrics.py
  • tests/test_dataset_config.py
  • tests/test_multiple_choice_evaluator.py
Latest Contributors(2)
UserCommitDate
orr@hirundo.ioLLM-76-Add-option-to-r...March 02, 2026
blewis@hirundo.ioLLM-68-Remove-empty-co...February 19, 2026
This pull request is reviewed by Baz. Review like a pro on (Baz).

@benglewis benglewis changed the title Split CBBQ handling into dedicated dataset loader and add CBBQ evaluator LLM-57: Split CBBQ handling into dedicated dataset loader and add CBBQ evaluator Jan 12, 2026
@benglewis benglewis self-assigned this Jan 12, 2026
@baz-reviewer
Copy link
Copy Markdown
Contributor

baz-reviewer Bot commented Jan 12, 2026

Your team has hit this month's review limit. Upgrade your subscription to unlock unlimited reviews and supercharge your code review workflow.

1 similar comment
@baz-reviewer
Copy link
Copy Markdown
Contributor

baz-reviewer Bot commented Jan 12, 2026

Your team has hit this month's review limit. Upgrade your subscription to unlock unlimited reviews and supercharge your code review workflow.

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
…bbq-support

# Conflicts:
#	llm_behavior_eval/evaluate.py
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/enums.py
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/multiple_choice_evaluator.py
Comment thread llm_behavior_eval/evaluation_utils/multiple_choice_evaluator.py
@baz-reviewer
Copy link
Copy Markdown
Contributor

baz-reviewer Bot commented Feb 25, 2026

Spec Reviewer Report    📪 ✅

Checkout in Baz

All 3 Identified Requirements Met for Ticket:

Add CBBQ support


3 met requirements
# Requirement Explanation
1 Evaluator supports CBBQ behavior presets, dataset type inference for unbias splits, and CLI help text updates The CLI now exposes both short and explicit CBBQ presets, the evaluator maps those presets to `hirundo-io/cbbq-…-multi-choice` datasets, infers UNBIAS splits via `_infer_dataset_type`, routes such datasets through the CBBQ-specific loader/evaluator, and the help/README describe the new flags.
evidence- llm_behavior_eval/evaluate.py:78-164 – `_behavior_presets` now recognizes CBBQ short/explicit presets and returns `hirundo-io/cbbq---multi-choice` IDs. - llm_behavior_eval/evaluate.py:188-205 & 584-594 – `_infer_dataset_type` switches to `DatasetType.UNBIAS` for `-unbias-` datasets and dataset creation marks `/cbbq-` runs with the MULTIPLE_CHOICE format before passing them to `EvaluateFactory`. - llm_behavior_eval/evaluation_utils/evaluate_factory.py:25-37 & llm_behavior_eval/evaluation_utils/base_evaluator.py:170-214 – CBBQ dataset IDs instantiate `CbbqEvaluator` and load `CbbqDataset`, ensuring the new format flows through the dedicated loader and metrics logic. - README.md:45-63 – README documents the new short/explicit CBBQ CLI presets so users know how to request bias/unbias splits. - tests/test_evaluate_cli.py:389-452 – CLI tests assert that CBBQ behaviors produce the expected file paths, dataset types, and multiple-choice answer formats.
2 Route CBBQ datasets through the bias evaluator and add summary labeling CBBQ dataset selectors now build DatasetConfig entries that drive EvaluateFactory to instantiate CbbqEvaluator and the base evaluator loads CbbqDataset, while summary helpers emit “CBBQ: …” labels and write CBBQ-specific CSV summaries.
evidence- llm_behavior_eval/evaluate.py:575-607 – DatasetConfig marks /cbbq- files as MULTIPLE_CHOICE and EvaluateFactory branches to CbbqEvaluator - llm_behavior_eval/evaluation_utils/base_evaluator.py:170-214 – CBBQ files instantiate CbbqDataset so bias evaluator preprocessing/loader paths are executed - llm_behavior_eval/evaluation_utils/base_evaluator.py:394-444 – infer_bias_label_from_slug now emits “CBBQ: ” for summary_brief rows - llm_behavior_eval/evaluation_utils/cbbq_evaluator.py:368-415 – finalize_artifacts writes summary_cbbq.csv and aggregated cbbq_overall_summary.csv with dataset_type/dimension labels
3 CBBQ bias types and behavior presets defined and tested CBBQ bias-type enums plus CLI preset handling and answer-format wiring are implemented and exercised by CLI tests, fulfilling the requirement.
evidence- llm_behavior_eval/evaluation_utils/enums.py:25-67 defines CBBQ_BIAS_TYPES, CBBQ_BASIC_TYPES, and CBBQ_ALL_TYPES for the new benchmark - llm_behavior_eval/evaluate.py:78-185 expands CLI preset parsing to support CBBQ short/explicit forms and maps them to multi-choice datasets - tests/test_evaluate_cli.py:389-498 asserts behavior presets route to the correct CBBQ dataset slugs and set AnswerFormat.MULTIPLE_CHOICE

Note: Some optional integrations are missing, so it might not be possible to check some of the requirements.
For best results, make sure the following are integrated: Figma



Used resources:
Hash: 9cddd15 | Ticket: link

To rerun the Spec Reviewer, comment "baz rerun spec review".

Comment thread llm_behavior_eval/evaluation_utils/multiple_choice_evaluator.py
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py
Comment thread dataset_processing_scripts/upload_cbbq_to_hub.py
Comment thread llm_behavior_eval/evaluation_utils/cbbq_dataset.py Outdated
@benglewis benglewis marked this pull request as ready for review February 25, 2026 19:33
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Comment thread dataset_processing_scripts/upload_cbbq_to_hub.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py
Comment thread llm_behavior_eval/evaluation_utils/cbbq_dataset.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4810e6d2eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread llm_behavior_eval/evaluation_utils/multiple_choice_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluate.py
Comment thread llm_behavior_eval/evaluate.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py
Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/cbbq_dataset.py
Comment thread llm_behavior_eval/evaluation_utils/cbbq_dataset.py


def test_generate_and_grade_shared_flow(tmp_path) -> None:
evaluator = _DummyMultipleChoiceEvaluator(output_dir=tmp_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "_DummyMultipleChoiceEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)

Finding type: Log Error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit f202eb6 addressed this comment by renaming MultipleChoiceEvaluator.grade to _grade_impl, satisfying BaseEvaluator's requirement and allowing _DummyMultipleChoiceEvaluator instantiation without abstract method errors.



def test_evaluate_runs_generate_free_grade_in_order(tmp_path) -> None:
evaluator = _DummyMultipleChoiceEvaluator(output_dir=tmp_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "_DummyMultipleChoiceEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)

Finding type: Log Error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit f202eb6 addressed this comment by providing a concrete _grade_impl implementation in MultipleChoiceEvaluator, fulfilling the abstract BaseEvaluator._grade_impl contract so _DummyMultipleChoiceEvaluator can be instantiated without having to override it.

elif "cbbq" in dataset_id:
from .cbbq_evaluator import CbbqEvaluator

return CbbqEvaluator(eval_config, dataset_config)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "CbbqEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)

Finding type: Log Error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 1a499e8 addressed this comment by introducing a concrete grade() implementation in BaseEvaluator that delegates to a new abstract _grade_impl(), and updating each evaluator (including FreeTextBiasEvaluator, the parent of CbbqEvaluator) to implement _grade_impl(), allowing CbbqEvaluator to satisfy the abstract requirements.

dataset_type=DatasetType.BIAS,
preprocess_config=PreprocessConfig(),
)
evaluator = CbbqEvaluator(eval_config, dataset_config)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "CbbqEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)

Finding type: Log Error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit f202eb6 addressed this comment by renaming MultipleChoiceEvaluator’s grade method to implement _grade_impl, fulfilling BaseEvaluator’s abstract requirement so CbbqEvaluator can be instantiated without Pyright errors.

Comment on lines +47 to +51
def _normalize_label(raw_value: object) -> int | None:
"""Normalize parsed CBBQ label values to ``{0,1,2}``.

Args:
raw_value: Raw value from ``responses.json``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalize_label reimplements the same {0,1,2} parsing rules already defined in llm_behavior_eval/evaluation_utils/cbbq_dataset.py::_normalize_cbbq_label. If that original helper ever tightens or loosens what counts as valid, we now need to keep two copies in sync which will diverge and cause this converter to drop/keep different rows than the dataset loader; can we reuse the existing helper (or extract a shared utility) instead of duplicating the logic?

Finding type: Code Dedup and Conventions

Comment on lines +70 to +74
def _normalize_polarity(raw_value: object) -> str | None:
"""Normalize raw polarity values to CBBQ tokens.

Args:
raw_value: Raw ``question_polarity`` value from ``responses.json``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalize_polarity duplicates the normalization rules that live in llm_behavior_eval/evaluation_utils/cbbq_dataset.py::_normalize_cbbq_polarity; any change (e.g., allowing new string aliases) would have to be repeated here, risking metric mismatch between the converter and the dataset loader. Can we factor this into a shared helper or import the existing one instead of duplicating it?

Finding type: Code Dedup and Conventions

Comment on lines +223 to +227
for response_row in responses:
predicted_label = _normalize_label(response_row.get("predicted_label"))
if predicted_label is None:
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_convert_bias_outputs now continues when _normalize_label rejects predicted_label, so outputs.csv omits those samples even though examples (derived from responses/num_samples) covers every request; downstream CBBQ eval expects one output row per sample and now gets fewer rows than metrics.csv claims, breaking alignment—can we keep a row (e.g. blank generated) instead of skipping it?

Finding type: Logical Bugs


Fix in Cursor

Prompt for AI Agents:

In dataset_processing_scripts/convert_cbbq_eval_output.py around lines 223 to 227, the
_convert_bias_outputs function currently does `if predicted_label is None: continue`,
which drops samples and breaks alignment with metrics.csv. Change this so it does not
continue; instead, append a row for that sample with "generated" set to an empty string
(or another explicit placeholder), "label" set to the normalized gold_label if present
or empty string, and "raw_output" set to the generated_text. Ensure you only use
LABEL_TO_LETTER when predicted_label is not None to avoid KeyError. This preserves one
output row per input sample and keeps outputs.csv aligned with examples/metrics.

Comment on lines +342 to +353
target_dir = output_root / "disambiguous" / run_folder.bias_type
outputs_df = _convert_disamb_outputs(responses)
converted_metrics = pd.DataFrame(
[
{
"bias": run_folder.bias_type,
"model": model_tag,
"examples": examples,
"accuracy": _metric_as_float(
metrics_row,
"disambiguated_accuracy",
0.0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disambiguated metrics are emitted with a disamb_bias_score column while the documented/expected schema (README lines 69‑71 and the upstream CBBQ eval_disamb.py) uses disambiguated_bias_score, so downstream consumers looking for the canonical column no longer see the bias score; can we keep the documented column name?

Finding type: Breaking Changes


Fix in Cursor

Prompt for AI Agents:

In dataset_processing_scripts/convert_cbbq_eval_output.py around lines 342 to 360, the
else branch of the _convert_one_run function constructs converted_metrics using the key
'disamb_bias_score', which breaks the expected schema. Change that dictionary key to
'disambiguated_bias_score' so the CSV column matches the documented/upstream schema
(optionally also emit an alias column for backward compatibility). Ensure the value
still comes from _metric_as_float(metrics_row, 'disambiguated_bias_score', 0.0).

Comment on lines +414 to +425
model_results_dir: Annotated[
Path,
typer.Option(
"--model-results-dir",
help=(
"Path to one model results directory (contains cbbq-* run folders)."
),
),
],
output_root: Annotated[
Path,
typer.Option(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--output-root now defaults to Path("converted_cbbq") and --model-tag falls back to the model folder name, yet their Typer help strings omit those defaults so users must read code to know the behavior; can we mention the defaults (and platform/location when relevant) in the help text exactly where the options are defined?

Finding type: Document parameters clearly


Fix in Cursor

Prompt for AI Agents:

In dataset_processing_scripts/convert_cbbq_eval_output.py around lines 414 to 439, the
Typer option help strings for output_root and model_tag omit their defaults. Update the
Typer.Option help for --output-root to mention that it defaults to
Path("converted_cbbq") (i.e., a folder named converted_cbbq created in the current
working directory) and update the --model-tag help to state it defaults to the model
results folder name when not provided. Keep the rest of the option definitions unchanged
and ensure the messages remain concise and user-facing.

Comment on lines +18 to +29
CBBQ_MAX_ANSWER_TOKENS = 128
_CBBQ_ANSWER_RE = re.compile(r"(?<![A-Za-z0-9])([ABC])(?![A-Za-z0-9])", re.IGNORECASE)


class CbbqSampleMetadata(TypedDict):
"""Typed per-sample metadata extracted from CBBQ batches."""

gold_label: int
question_polarity: int


def extract_cbbq_prediction(generated_text: str) -> int | None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New helper extract_cbbq_prediction is documented with a single summary line but no Args/Returns sections, so consumers can't see the expected input type or the None/label output without reading the implementation; can we add the required docstring sections (and mention generated_text: str and the int | None return) per the documentation rule?

Finding type: Document parameters clearly | Severity: 🟢 Low


Fix in Cursor

Prompt for AI Agents:

In llm_behavior_eval/evaluation_utils/cbbq_evaluator.py around lines 18 to 41, the
helper function extract_cbbq_prediction currently has only a one-line summary. Update
its docstring to include explicit Args and Returns sections: document generated_text:
str as the input parameter, and document the return as int | None (explain that it
returns a label id 0/1/2 when a valid A/B/C choice is found and returns None when no
valid match is found). Keep the existing short summary and add a one-line description of
the regex-based extraction behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants