LLM-57: Split CBBQ handling into dedicated dataset loader and add CBBQ evaluator by benglewis · Pull Request #89 · Hirundo-io/llm-behavior-eval

benglewis · 2026-01-12T19:53:14Z

User description

Motivation

Provide first-class preprocessing for CBBQ multiple-choice inputs with stricter typing and validation so downstream evaluators can compute CBBQ-specific metrics.
Keep CustomDataset focused on free-text datasets by moving CBBQ logic to a separate loader.
Route CBBQ dataset ids to a dedicated evaluator that emits the CBBQ metrics and artifacts.
Ensure the CLI/presets support cbbq: shortcuts for easy evaluation runs.

Description

Added llm_behavior_eval/evaluation_utils/cbbq_dataset.py which implements cbbq_preprocess_function, strict normalization helpers (_normalize_cbbq_label, _normalize_cbbq_polarity), validate_cbbq_columns, and a CbbqDataset class to load/preprocess CBBQ datasets.
Implemented llm_behavior_eval/evaluation_utils/cbbq_evaluator.py as CbbqEvaluator with ambiguous/disambiguated counting classes and metric exports (metrics.csv, responses.json) and MLflow logging hooks.
Updated evaluator selection and dataset loading: base_evaluator.prepare_dataloader now picks CbbqDataset when the slug contains cbbq, evaluate.py mapping supports cbbq:bias:<type> / cbbq:unbias:<type>, and evaluate_factory.create_evaluator routes cbbq ids to CbbqEvaluator.
Corrected CBBQ_BIAS_TYPES in evaluation_utils/enums.py to the full taxonomy and removed CBBQ-specific logic from custom_dataset.py; also addressed a lint issue by removing an unused import.

Testing

Ran ruff format . which reformatted one file and completed successfully.
Ran ruff check . and all lint checks passed.
Ran pytest in this environment but collection failed due to missing optional dependencies (pydantic_settings, datasets, torch) so the full test suite could not be executed here; the new unit tests added are tests/test_cbbq_metrics.py and the CLI mapping coverage in tests/test_evaluate_cli.py.
Attempted basedpyright but the command is not available in this environment.

Codex Task

Note

Medium Risk
Introduces a new evaluation mode and refactors CLI orchestration to reuse evaluators differently, which can affect dataset routing, output artifacts, and grading behavior across multi-dataset runs.

Overview
Adds first-class support for CBBQ (Chinese contextual bias benchmark) as a multiple-choice evaluation flow, including a new CbbqDataset preprocessor and a generic MultipleChoiceEvaluator base that grades without a judge model and writes CBBQ-specific metrics.csv/responses.json plus per-model summary aggregates.

Integrates CBBQ into the CLI by adding cbbq: behavior presets (short and explicit forms), tagging datasets with a new DatasetConfig.answer_format, and updating the run loop to manage separate evaluator instances per “evaluator family” so mixed CBBQ + non-CBBQ runs don’t reuse incompatible evaluators.

Adds a upload_cbbq_to_hub.py utility to validate/normalize raw CBBQ CSVs and push them to HuggingFace dataset repos, expands enums/docs/tests to cover CBBQ types/metrics, and tightens dev workflow config (direnv + pre-commit paths/stage tweaks).

^{Written by Cursor Bugbot for commit 907328e. This will update automatically on new commits. Configure here.}

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Add a dedicated CBBQ evaluation stack by introducing the multiple-choice loader/evaluator pair, deterministic grading flow, and enriched DatasetConfig metadata so bias/unbias runs emit the expected metrics, summaries, and MLflow entries. Document and wire the new presets, enums, CLI routing, and conversion/upload tooling so users can invoke cbbq: behaviors, push normalized datasets to Hugging Face, and keep the tightened dev workflow aligned.

Topic Details

Tooling & dataset publishing

Harden tooling, dataset publishing, and developer workflows by refreshing the environment guidance, dependency locking, and git/direnv hooks, adding the CBBQ conversion/upload scripts (with their tests) plus helpful import safety checks, so reproducing/pushing CBBQ data and running checks works reliably.

Modified files (9)

.envrc
.gitignore
.pre-commit-config.yaml
AGENTS.md
dataset_processing_scripts/convert_cbbq_eval_output.py
dataset_processing_scripts/upload_cbbq_to_hub.py
pyproject.toml
tests/test_upload_cbbq_to_hub.py
tests/test_util_functions.py

Latest Contributors(2)

User	Commit	Date
orr@hirundo.io	LLM-75-Fix-Result-Dire...	March 01, 2026
blewis@hirundo.io	LLM-38-Create-Developm...	November 18, 2025

Other

Other files

Modified files (1)

uv.lock

Latest Contributors(2)

User	Commit	Date
orr@hirundo.io	LLM-74-Upgrade-vllm-Re...	February 23, 2026
github-actions[bot]	Bump-version-to-0.1.6b...	February 19, 2026

CLI & docs

Integrate CLI routing/presets by teaching evaluate.py about the cbbq: shortcuts, normalizing dataset types via the expanded enums, routing to CbbqEvaluator through EvaluateFactory, and validating the behavior through README updates and CLI tests so users can request CBBQ bias/unbias flows that resolve to the correct multiple-choice answer format.

Modified files (5)

README.md
llm_behavior_eval/evaluate.py
llm_behavior_eval/evaluation_utils/enums.py
llm_behavior_eval/evaluation_utils/evaluate_factory.py
tests/test_evaluate_cli.py

Latest Contributors(2)

User	Commit	Date
orr@hirundo.io	LLM-75-Fix-Result-Dire...	March 01, 2026
mishana4life@gmail.com	LLM-64-Change-results-...	February 16, 2026

CBBQ evaluation core

Build the new CBBQ evaluation core by adding CbbqDataset, CbbqEvaluator, and the shared MultipleChoiceEvaluator, updating BaseEvaluator/CustomDataset to pick the right loader, tracking DatasetConfig.answer_format, and covering the flow through dataset/evaluator/unit tests so both splits emit the canonical metrics and summaries.

Modified files (10)

llm_behavior_eval/evaluation_utils/base_evaluator.py
llm_behavior_eval/evaluation_utils/cbbq_dataset.py
llm_behavior_eval/evaluation_utils/cbbq_evaluator.py
llm_behavior_eval/evaluation_utils/custom_dataset.py
llm_behavior_eval/evaluation_utils/dataset_config.py
llm_behavior_eval/evaluation_utils/multiple_choice_evaluator.py
tests/test_cbbq_dataset.py
tests/test_cbbq_metrics.py
tests/test_dataset_config.py
tests/test_multiple_choice_evaluator.py

Latest Contributors(2)

User	Commit	Date
orr@hirundo.io	LLM-76-Add-option-to-r...	March 02, 2026
blewis@hirundo.io	LLM-68-Remove-empty-co...	February 19, 2026

This pull request is reviewed by Baz. Review like a pro on (Baz).

baz-reviewer · 2026-01-12T21:25:57Z

Your team has hit this month's review limit. Upgrade your subscription to unlock unlimited reviews and supercharge your code review workflow.

baz-reviewer · 2026-01-12T21:26:20Z

Your team has hit this month's review limit. Upgrade your subscription to unlock unlimited reviews and supercharge your code review workflow.

…bbq-support # Conflicts: # llm_behavior_eval/evaluate.py

…bbq-support # Conflicts: # tests/test_evaluate_cli.py

…uator.py`

…instead of `-multi-choice` and missing docs in README.md

baz-reviewer · 2026-02-25T17:40:06Z

Spec Reviewer Report 📪 ✅

Checkout in Baz

All 3 Identified Requirements Met for Ticket:

Add CBBQ support

3 met requirements

#	Requirement	Explanation
1	Evaluator supports CBBQ behavior presets, dataset type inference for unbias splits, and CLI help text updates	The CLI now exposes both short and explicit CBBQ presets, the evaluator maps those presets to `hirundo-io/cbbq-…-multi-choice` datasets, infers UNBIAS splits via `_infer_dataset_type`, routes such datasets through the CBBQ-specific loader/evaluator, and the help/README describe the new flags. evidence - llm_behavior_eval/evaluate.py:78-164 – `_behavior_presets` now recognizes CBBQ short/explicit presets and returns `hirundo-io/cbbq---multi-choice` IDs. - llm_behavior_eval/evaluate.py:188-205 & 584-594 – `_infer_dataset_type` switches to `DatasetType.UNBIAS` for `-unbias-` datasets and dataset creation marks `/cbbq-` runs with the MULTIPLE_CHOICE format before passing them to `EvaluateFactory`. - llm_behavior_eval/evaluation_utils/evaluate_factory.py:25-37 & llm_behavior_eval/evaluation_utils/base_evaluator.py:170-214 – CBBQ dataset IDs instantiate `CbbqEvaluator` and load `CbbqDataset`, ensuring the new format flows through the dedicated loader and metrics logic. - README.md:45-63 – README documents the new short/explicit CBBQ CLI presets so users know how to request bias/unbias splits. - tests/test_evaluate_cli.py:389-452 – CLI tests assert that CBBQ behaviors produce the expected file paths, dataset types, and multiple-choice answer formats.
2	Route CBBQ datasets through the bias evaluator and add summary labeling	CBBQ dataset selectors now build DatasetConfig entries that drive EvaluateFactory to instantiate CbbqEvaluator and the base evaluator loads CbbqDataset, while summary helpers emit “CBBQ: …” labels and write CBBQ-specific CSV summaries. evidence - llm_behavior_eval/evaluate.py:575-607 – DatasetConfig marks /cbbq- files as MULTIPLE_CHOICE and EvaluateFactory branches to CbbqEvaluator - llm_behavior_eval/evaluation_utils/base_evaluator.py:170-214 – CBBQ files instantiate CbbqDataset so bias evaluator preprocessing/loader paths are executed - llm_behavior_eval/evaluation_utils/base_evaluator.py:394-444 – infer_bias_label_from_slug now emits “CBBQ: ” for summary_brief rows - llm_behavior_eval/evaluation_utils/cbbq_evaluator.py:368-415 – finalize_artifacts writes summary_cbbq.csv and aggregated cbbq_overall_summary.csv with dataset_type/dimension labels
3	CBBQ bias types and behavior presets defined and tested	CBBQ bias-type enums plus CLI preset handling and answer-format wiring are implemented and exercised by CLI tests, fulfilling the requirement. evidence - llm_behavior_eval/evaluation_utils/enums.py:25-67 defines CBBQ_BIAS_TYPES, CBBQ_BASIC_TYPES, and CBBQ_ALL_TYPES for the new benchmark - llm_behavior_eval/evaluate.py:78-185 expands CLI preset parsing to support CBBQ short/explicit forms and maps them to multi-choice datasets - tests/test_evaluate_cli.py:389-498 asserts behavior presets route to the correct CBBQ dataset slugs and set AnswerFormat.MULTIPLE_CHOICE

Note: Some optional integrations are missing, so it might not be possible to check some of the requirements.
For best results, make sure the following are integrated: Figma

Used resources:
Hash: 9cddd15 | Ticket: link

To rerun the Spec Reviewer, comment "baz rerun spec review".

…g implementation

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4810e6d2eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

…bbq-support # Conflicts: # .gitignore

baz-reviewer · 2026-03-04T18:25:13Z

+
+
+def test_generate_and_grade_shared_flow(tmp_path) -> None:
+    evaluator = _DummyMultipleChoiceEvaluator(output_dir=tmp_path)


❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "_DummyMultipleChoiceEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)

_{Finding type: Log Error}

Commit f202eb6 addressed this comment by renaming MultipleChoiceEvaluator.grade to _grade_impl, satisfying BaseEvaluator's requirement and allowing _DummyMultipleChoiceEvaluator instantiation without abstract method errors.

baz-reviewer · 2026-03-04T18:25:13Z

+
+
+def test_evaluate_runs_generate_free_grade_in_order(tmp_path) -> None:
+    evaluator = _DummyMultipleChoiceEvaluator(output_dir=tmp_path)


❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "_DummyMultipleChoiceEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)

_{Finding type: Log Error}

Commit f202eb6 addressed this comment by providing a concrete _grade_impl implementation in MultipleChoiceEvaluator, fulfilling the abstract BaseEvaluator._grade_impl contract so _DummyMultipleChoiceEvaluator can be instantiated without having to override it.

baz-reviewer · 2026-03-04T18:25:13Z

+        elif "cbbq" in dataset_id:
+            from .cbbq_evaluator import CbbqEvaluator
+
+            return CbbqEvaluator(eval_config, dataset_config)


❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "CbbqEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)

_{Finding type: Log Error}

Commit 1a499e8 addressed this comment by introducing a concrete grade() implementation in BaseEvaluator that delegates to a new abstract _grade_impl(), and updating each evaluator (including FreeTextBiasEvaluator, the parent of CbbqEvaluator) to implement _grade_impl(), allowing CbbqEvaluator to satisfy the abstract requirements.

baz-reviewer · 2026-03-04T18:25:13Z

+        dataset_type=DatasetType.BIAS,
+        preprocess_config=PreprocessConfig(),
+    )
+    evaluator = CbbqEvaluator(eval_config, dataset_config)


❌ Failed check: pyright / pyright
I’ve attached the relevant part of the log for your convenience:
Cannot instantiate abstract class "CbbqEvaluator". "BaseEvaluator._grade_impl" is not implemented (reportAbstractUsage)

_{Finding type: Log Error}

Commit f202eb6 addressed this comment by renaming MultipleChoiceEvaluator’s grade method to implement _grade_impl, fulfilling BaseEvaluator’s abstract requirement so CbbqEvaluator can be instantiated without Pyright errors.

baz-reviewer · 2026-03-04T18:31:48Z

+def _normalize_label(raw_value: object) -> int | None:
+    """Normalize parsed CBBQ label values to ``{0,1,2}``.
+
+    Args:
+        raw_value: Raw value from ``responses.json``.


_normalize_label reimplements the same {0,1,2} parsing rules already defined in llm_behavior_eval/evaluation_utils/cbbq_dataset.py::_normalize_cbbq_label. If that original helper ever tightens or loosens what counts as valid, we now need to keep two copies in sync which will diverge and cause this converter to drop/keep different rows than the dataset loader; can we reuse the existing helper (or extract a shared utility) instead of duplicating the logic?

_{Finding type: Code Dedup and Conventions}

baz-reviewer · 2026-03-04T18:31:48Z

+def _normalize_polarity(raw_value: object) -> str | None:
+    """Normalize raw polarity values to CBBQ tokens.
+
+    Args:
+        raw_value: Raw ``question_polarity`` value from ``responses.json``.


_normalize_polarity duplicates the normalization rules that live in llm_behavior_eval/evaluation_utils/cbbq_dataset.py::_normalize_cbbq_polarity; any change (e.g., allowing new string aliases) would have to be repeated here, risking metric mismatch between the converter and the dataset loader. Can we factor this into a shared helper or import the existing one instead of duplicating it?

_{Finding type: Code Dedup and Conventions}

baz-reviewer · 2026-03-04T18:31:48Z

+    for response_row in responses:
+        predicted_label = _normalize_label(response_row.get("predicted_label"))
+        if predicted_label is None:
+            continue
+


_convert_bias_outputs now continues when _normalize_label rejects predicted_label, so outputs.csv omits those samples even though examples (derived from responses/num_samples) covers every request; downstream CBBQ eval expects one output row per sample and now gets fewer rows than metrics.csv claims, breaking alignment—can we keep a row (e.g. blank generated) instead of skipping it?

_{Finding type: Logical Bugs}

Prompt for AI Agents:

In dataset_processing_scripts/convert_cbbq_eval_output.py around lines 223 to 227, the _convert_bias_outputs function currently does `if predicted_label is None: continue`, which drops samples and breaks alignment with metrics.csv. Change this so it does not continue; instead, append a row for that sample with "generated" set to an empty string (or another explicit placeholder), "label" set to the normalized gold_label if present or empty string, and "raw_output" set to the generated_text. Ensure you only use LABEL_TO_LETTER when predicted_label is not None to avoid KeyError. This preserves one output row per input sample and keeps outputs.csv aligned with examples/metrics.

baz-reviewer · 2026-03-04T18:31:48Z

+        target_dir = output_root / "disambiguous" / run_folder.bias_type
+        outputs_df = _convert_disamb_outputs(responses)
+        converted_metrics = pd.DataFrame(
+            [
+                {
+                    "bias": run_folder.bias_type,
+                    "model": model_tag,
+                    "examples": examples,
+                    "accuracy": _metric_as_float(
+                        metrics_row,
+                        "disambiguated_accuracy",
+                        0.0,


Disambiguated metrics are emitted with a disamb_bias_score column while the documented/expected schema (README lines 69‑71 and the upstream CBBQ eval_disamb.py) uses disambiguated_bias_score, so downstream consumers looking for the canonical column no longer see the bias score; can we keep the documented column name?

_{Finding type: Breaking Changes}

Prompt for AI Agents:

In dataset_processing_scripts/convert_cbbq_eval_output.py around lines 342 to 360, the else branch of the _convert_one_run function constructs converted_metrics using the key 'disamb_bias_score', which breaks the expected schema. Change that dictionary key to 'disambiguated_bias_score' so the CSV column matches the documented/upstream schema (optionally also emit an alias column for backward compatibility). Ensure the value still comes from _metric_as_float(metrics_row, 'disambiguated_bias_score', 0.0).

baz-reviewer · 2026-03-04T18:31:48Z

+    model_results_dir: Annotated[
+        Path,
+        typer.Option(
+            "--model-results-dir",
+            help=(
+                "Path to one model results directory (contains cbbq-* run folders)."
+            ),
+        ),
+    ],
+    output_root: Annotated[
+        Path,
+        typer.Option(


--output-root now defaults to Path("converted_cbbq") and --model-tag falls back to the model folder name, yet their Typer help strings omit those defaults so users must read code to know the behavior; can we mention the defaults (and platform/location when relevant) in the help text exactly where the options are defined?

_{Finding type: Document parameters clearly}

Prompt for AI Agents:

In dataset_processing_scripts/convert_cbbq_eval_output.py around lines 414 to 439, the Typer option help strings for output_root and model_tag omit their defaults. Update the Typer.Option help for --output-root to mention that it defaults to Path("converted_cbbq") (i.e., a folder named converted_cbbq created in the current working directory) and update the --model-tag help to state it defaults to the model results folder name when not provided. Keep the rest of the option definitions unchanged and ensure the messages remain concise and user-facing.

…bbq-support

…for the correct answer and increase `max_answer_tokens` to match original CBBQ repository

…tion This is probably wrong, but necessary to match broken upstream at this stage

baz-reviewer · 2026-03-08T10:43:56Z

+CBBQ_MAX_ANSWER_TOKENS = 128
+_CBBQ_ANSWER_RE = re.compile(r"(?<![A-Za-z0-9])([ABC])(?![A-Za-z0-9])", re.IGNORECASE)
+
+
+class CbbqSampleMetadata(TypedDict):
+    """Typed per-sample metadata extracted from CBBQ batches."""
+
+    gold_label: int
+    question_polarity: int
+
+
+def extract_cbbq_prediction(generated_text: str) -> int | None:


New helper extract_cbbq_prediction is documented with a single summary line but no Args/Returns sections, so consumers can't see the expected input type or the None/label output without reading the implementation; can we add the required docstring sections (and mention generated_text: str and the int | None return) per the documentation rule?

_{Finding type: Document parameters clearly | Severity: 🟢 Low}

Prompt for AI Agents:

In llm_behavior_eval/evaluation_utils/cbbq_evaluator.py around lines 18 to 41, the helper function extract_cbbq_prediction currently has only a one-line summary. Update its docstring to include explicit Args and Returns sections: document generated_text: str as the input parameter, and document the return as int | None (explain that it returns a label id 0/1/2 when a valid A/B/C choice is found and returns None when no valid match is found). Keep the existing short summary and add a one-line description of the regex-based extraction behavior.

Refine CBBQ dataset handling

01471a6

benglewis added the codex label Jan 12, 2026 — with ChatGPT Codex Connector

benglewis changed the title ~~Split CBBQ handling into dedicated dataset loader and add CBBQ evaluator~~ LLM-57: Split CBBQ handling into dedicated dataset loader and add CBBQ evaluator Jan 12, 2026

benglewis self-assigned this Jan 12, 2026

orr-hirundo reviewed Jan 20, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated

Merge branch 'main' into codex/2026-01-12/linear-mention-llm-57-add-c…

d337cf3

…bbq-support # Conflicts: # llm_behavior_eval/evaluate.py

cursor Bot reviewed Jan 22, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated

Comment thread llm_behavior_eval/evaluation_utils/enums.py

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated

benglewis added 3 commits January 22, 2026 10:28

Fix failing PyTest test

1585868

Fix PR comments

e461c08

Merge branch 'main' into codex/2026-01-12/linear-mention-llm-57-add-c…

2350eec

…bbq-support # Conflicts: # tests/test_evaluate_cli.py

cursor Bot reviewed Feb 22, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated

Cleanup multiple choice specific CBBQ code into `multiple_choice_eval…

cc1437d

…uator.py`

cursor Bot reviewed Feb 22, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated

Comment thread llm_behavior_eval/evaluation_utils/multiple_choice_evaluator.py

benglewis added 4 commits February 25, 2026 14:17

Add .envrc to make using this repo nicer

f30e2b4

Update AGENTS.md to make adding new datasets a little easier

e7ac537

Update AGENTS.md to try to produce more relaible results

cc59d57

Add scripts to upload cbbq to HuggingFace Hub and fix -free-text …

9cddd15

…instead of `-multi-choice` and missing docs in README.md

cursor Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/multiple_choice_evaluator.py

benglewis added 3 commits February 25, 2026 18:41

Fix CBBQ Evaluator

ac367e8

Fix README.md for CBBQ and align behavior more correctly with existin…

07a7dc0

…g implementation

Align CBBQ CSV file naming with other evaluators

4810e6d

cursor Bot reviewed Feb 25, 2026

View reviewed changes

benglewis marked this pull request as ready for review February 25, 2026 19:33

benglewis requested review from RotemHir, mishana and shmuelyo as code owners February 25, 2026 19:33

greptile-apps Bot reviewed Feb 25, 2026

View reviewed changes

baz-reviewer Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread dataset_processing_scripts/upload_cbbq_to_hub.py Outdated

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py

Comment thread llm_behavior_eval/evaluation_utils/cbbq_dataset.py

chatgpt-codex-connector Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/multiple_choice_evaluator.py Outdated

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated

Fix comments & CI

907328e

baz-reviewer Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluate.py

Comment thread llm_behavior_eval/evaluate.py

cursor Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py

Comment thread llm_behavior_eval/evaluation_utils/cbbq_evaluator.py Outdated

Comment thread llm_behavior_eval/evaluation_utils/cbbq_dataset.py

benglewis added 2 commits March 2, 2026 14:00

Fix CBBQ text validation helper

969cb80

Merge branch 'main' into codex/2026-01-12/linear-mention-llm-57-add-c…

6302a81

…bbq-support # Conflicts: # .gitignore

baz-reviewer Bot reviewed Mar 2, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/cbbq_dataset.py

baz-reviewer Bot added the baz approved label Mar 2, 2026

baz-reviewer Bot approved these changes Mar 2, 2026

View reviewed changes

Add script to convert llm-behavior-eval output to match CBBQ output

53fae90

baz-reviewer Bot removed the baz approved label Mar 4, 2026

baz-reviewer Bot reviewed Mar 4, 2026

View reviewed changes

benglewis added 6 commits March 5, 2026 14:10

Merge branch 'main' into codex/2026-01-12/linear-mention-llm-57-add-c…

1a499e8

…bbq-support

Fix multiple_choice_evaluator after merge from main

f202eb6

Fix results folder gitignore

f244553

Handle vllm not working on macOS (since it needs to be compiled)

f3a22bc

Change cbbq_evaluator.py to use regex search on the generated text …

ec85e95

…for the correct answer and increase `max_answer_tokens` to match original CBBQ repository

At least temporarily, switch to not use chat template for CBBQ evalua…

7be7b15

…tion This is probably wrong, but necessary to match broken upstream at this stage

baz-reviewer Bot reviewed Mar 8, 2026

View reviewed changes



		def test_generate_and_grade_shared_flow(tmp_path) -> None:
		evaluator = _DummyMultipleChoiceEvaluator(output_dir=tmp_path)



		def test_evaluate_runs_generate_free_grade_in_order(tmp_path) -> None:
		evaluator = _DummyMultipleChoiceEvaluator(output_dir=tmp_path)

Conversation

benglewis commented Jan 12, 2026 • edited by baz-reviewer Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Motivation

Description

Testing

Generated description

Uh oh!

baz-reviewer Bot commented Jan 12, 2026

Uh oh!

baz-reviewer Bot commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baz-reviewer Bot commented Feb 25, 2026

Spec Reviewer Report 📪 ✅

All 3 Identified Requirements Met for Ticket:

Add CBBQ support

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baz-reviewer Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Mar 4, 2026

benglewis commented Jan 12, 2026 •

edited by baz-reviewer Bot

Loading