Adding plugin evaluation by RotemHir · Pull Request #56 · Hirundo-io/llm-behavior-eval

RotemHir · 2025-11-25T11:44:07Z

User description

Note

Adds Plugin and Gemini inference backends, exposes new CLI/config options for selection/tokenization, and updates the default results directory.

Evaluation Backends:
- Add PluginEvalEngine and GeminiEvalEngine; select via BaseEvaluator (use_plugin/use_gemini), alongside existing Transformers/vLLM.
- Ensure tokenizer setup (left padding) and add eval_collator helper.
CLI (llm_behavior_eval/evaluate.py):
- New options: --use-plugin, --use-gemini, --gemini-model-name, --tokenizer-model.
- Update results path to results_400samples.
Config (EvaluationConfig):
- Add use_plugin, use_gemini, gemini_model_name, tokenizer_path_or_repo_id fields; propagate through evaluator creation.
Free‑text Bias Evaluator:
- Minor prompt text normalization and integration with new engines for answer generation.

^{Written by Cursor Bugbot for commit 4fcb29a. Configure here.}

Generated description

Below is a concise technical summary of the changes proposed in this PR:

graph LR
main_("main"):::modified
BaseEvaluator_init_("BaseEvaluator.__init__"):::modified
PluginEvalEngine_init_("PluginEvalEngine.__init__"):::added
GeminiEvalEngine_init_("GeminiEvalEngine.__init__"):::added
PluginEvalEngine_generate_answers_("PluginEvalEngine.generate_answers"):::added
PluginModelForCausalLM_("PluginModelForCausalLM"):::added
PluginEvalEngine_get_first_non_oom_batch_size_("PluginEvalEngine._get_first_non_oom_batch_size"):::added
GeminiEvalEngine_generate_answers_("GeminiEvalEngine.generate_answers"):::added
GeminiEvalEngine_call_gemini_single_("GeminiEvalEngine._call_gemini_single"):::added
genai_Client_("genai.Client"):::added
GeminiEvalEngine_build_prompts_from_input_ids_("GeminiEvalEngine._build_prompts_from_input_ids"):::added
GeminiEvalEngine_tokenizer_("GeminiEvalEngine.tokenizer"):::added
main_ -- "Passes new flags to select evaluation engine." --> BaseEvaluator_init_
BaseEvaluator_init_ -- "Adds PluginEvalEngine initialization on use_plugin flag." --> PluginEvalEngine_init_
BaseEvaluator_init_ -- "Adds GeminiEvalEngine initialization on use_gemini flag." --> GeminiEvalEngine_init_
PluginEvalEngine_generate_answers_ -- "Uses PluginModelForCausalLM for fused answer generation." --> PluginModelForCausalLM_
PluginEvalEngine_get_first_non_oom_batch_size_ -- "Probes batch size using PluginModelForCausalLM with backend." --> PluginModelForCausalLM_
GeminiEvalEngine_generate_answers_ -- "Sends prompts to Gemini API for answer generation." --> GeminiEvalEngine_call_gemini_single_
GeminiEvalEngine_call_gemini_single_ -- "Uses genai.Client to call Gemini API for content." --> genai_Client_
GeminiEvalEngine_build_prompts_from_input_ids_ -- "Decodes input_ids to text prompts for Gemini API." --> GeminiEvalEngine_tokenizer_
classDef added stroke:#15AA7A
classDef removed stroke:#CD5270
classDef modified stroke:#EDAC4C
linkStyle default stroke:#CBD5E1,font-size:13px

Introduce new evaluation engines, PluginEvalEngine and GeminiEvalEngine, to support model inference using a custom plugin model and the Google Gemini API. Update the BaseEvaluator to dynamically select these engines based on new command-line interface options and EvaluationConfig settings, while also adjusting tokenizer padding and data collation for compatibility.

Topic Details

New Eval Backends

Implement new evaluation engines for PluginModelForCausalLM and Google Gemini, integrating them into the BaseEvaluator for model inference. This includes adding new CLI options and configuration fields to select and configure these backends, and ensuring proper tokenizer setup with left padding and a new eval_collator helper.

Modified files (5)

llm_behavior_eval/evaluation_utils/plugin_eval_engine.py
llm_behavior_eval/evaluation_utils/gemini_eval_engine.py
llm_behavior_eval/evaluation_utils/eval_config.py
llm_behavior_eval/evaluation_utils/base_evaluator.py
llm_behavior_eval/evaluate.py

Latest Contributors(2)

User	Commit	Date
shmuelyo	LLM-51-Make-test-model...	December 02, 2025
blewis@hirundo.io	LLM-49-Adopt-.jsonl-fo...	December 02, 2025

Normalize Bias Prompts

Normalize prompt text within the FreeTextBiasEvaluator for consistency and integration with the new evaluation engines.

Modified files (1)

llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py

Latest Contributors(2)

User	Commit	Date
blewis@hirundo.io	LLM-49-Adopt-.jsonl-fo...	December 02, 2025
tomer@hirundo.io	LLM-27-Support-multi-m...	October 20, 2025

This pull request is reviewed by Baz. Review like a pro on (Baz).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: RotemHir <rotem@hirundo.io>

….com/Hirundo-io/llm-behavior-eval into llm-41-add-support-for-plugin-models

benglewis

A few important comments / suggestions

benglewis · 2025-11-26T11:51:22Z

+def eval_collator(features):
+    batch = default_data_collator(features)
+    batch["input_ids"] = batch.pop("test_input_ids")
+    batch["attention_mask"] = batch.pop("test_attention_mask")
+    return batch


Why do you need this? As far as I can see, all datasets are expected to produce test_input_ids and test_attention_mask and the Evaluator classes (e.g. FreeTextBiasEvaluator, FreeTextHaluEvaluator and FreeTextPromptInjectionEvaluator) are all passing the test_input_ids and test_attention_mask arguments to the input_ids and attention_mask arguments for the eval engine which then uses them as normal. I think that YAGNI and it may even break intended behavior

I don't need it and I didn't add this as far as i know

benglewis · 2025-11-26T11:52:35Z

+        if self.eval_config.use_plugin:
+            self.eval_engine = PluginEvalEngine(
+                self.data_collator,
+                self.eval_config,
+            )
+        elif self.use_vllm:


You will see when you pull that the code now has eval_engine which will make this cleaner

so to fix this now? do merge again?

benglewis · 2025-11-26T12:00:35Z

+            outputs = model.generate(
+                input_ids=cast("torch.LongTensor", model_input_ids),
+                attention_mask=cast("torch.LongTensor", model_attention),
+                max_new_tokens=self.eval_config.answer_tokens,
+                do_sample=self.eval_config.sample,
+                # temperature=self.eval_config.temperature, # TODO: add temperature
+                pad_token_id=tokenizer.pad_token_id,
+            )


You'll need to add this before the call:

Suggested change

outputs = model.generate(

input_ids=cast("torch.LongTensor", model_input_ids),

attention_mask=cast("torch.LongTensor", model_attention),

max_new_tokens=self.eval_config.answer_tokens,

do_sample=self.eval_config.sample,

# temperature=self.eval_config.temperature, # TODO: add temperature

pad_token_id=tokenizer.pad_token_id,

)

if sampling_config.do_sample is None:

do_sample = self._get_sample_from_config(self.eval_config, self.is_judge)

else:

do_sample = sampling_config.do_sample

max_new_tokens = self._get_max_new_tokens(self.eval_config, self.is_judge)

temperature = sampling_config.temperature

if temperature is None:

temperature = 1.0 if do_sample else 0.0

top_p = sampling_config.top_p if sampling_config.top_p is not None else 1.0

top_k = sampling_config.top_k if sampling_config.top_k is not None else 0

seed = sampling_config.seed

outputs = cast("GenerationMixin", self.model).generate(

input_ids=cast("torch.LongTensor", model_input_ids),

attention_mask=cast("torch.LongTensor", model_attention),

max_new_tokens=max_new_tokens,

do_sample=do_sample,

pad_token_id=tokenizer.pad_token_id,

eos_token_id=self.tokenizer.eos_token_id,

temperature=temperature,

top_p=top_p,

top_k=top_k,

seed=seed,

)

This is something new?
then should i do a new merge?

benglewis · 2025-11-26T12:02:06Z

+        self.eval_config = eval_config
+        self.device = next(self.model.parameters()).device
+
+    def generate_answers(


This essentially looks identical to the TransformersEngine version, are you sure that you need to override it for some reason?

The generate answers func?
mainly because of this:
model = cast("PluginModelForCausalLM", self.model)

I need to make sure that the generate function being used is the one of the plugin.
there is a possibility that the GenerationMixin will be enough though.... Just to clarify that this is a different generate function and it needs to use the plugin.

cursor

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2025-12-04T10:30:57Z

            reasoning=reasoning,
            use_vllm=use_vllm,
+            use_plugin=use_plugin,
+            tokenizer_path_or_repo_id=tokenizer_model,


Bug: Missing use_gemini and gemini_model_name in EvaluationConfig

The CLI accepts --use-gemini and --gemini-model-name arguments, but they are never passed to the EvaluationConfig constructor. This means users cannot actually enable Gemini evaluation via the command line - eval_config.use_gemini will always be False (the default), and the GeminiEvalEngine will never be used.

cursor · 2025-12-04T10:30:57Z

    model_path_or_repo_id = model
    judge_path_or_repo_id = judge_model
-    result_dir = Path(__file__).parent / "results"
+    result_dir = Path(__file__).parent / "results_400samples"


Bug: Hardcoded debug folder name in results directory

The results directory is changed from results to results_400samples. This appears to be debug or experiment-specific code that was accidentally committed. All evaluation results will now be written to a non-standard directory name.

cursor · 2025-12-04T10:30:57Z

     refuses, not answerable.
   ✘ NO otherwise.

+C. *Equal-attribution answers*


Bug: Duplicate line in system prompt template

The line C. *Equal-attribution answers* appears twice consecutively (lines 107 and 108). This is likely a bad merge artifact as noted in the PR discussion, resulting in a malformed prompt that will be sent to the judge model.

cursor · 2025-12-04T10:30:57Z

+
+        for ids_row, mask_row in zip(input_ids, attention_mask, strict=False):
+            valid_len = int(mask_row.sum().item())
+            trimmed_ids = ids_row[:valid_len]


Bug: Left-padding incompatible with Gemini prompt extraction logic

The _build_prompts_from_input_ids method extracts tokens using ids_row[:valid_len], which assumes right-padding (padding at the end). However, BaseEvaluator sets tokenizer.padding_side = "left", meaning padding tokens are at the beginning and valid tokens are at the end. This causes GeminiEvalEngine to decode padding tokens instead of the actual prompt content, resulting in empty or garbage prompts being sent to the Gemini API.

baz-reviewer · 2025-12-29T10:10:16Z

+        # Load tokenizer based on HF model path. This is *not* Gemini's tokenizer,
+        # but your "reference" tokenizer (e.g., plugin base) used for prompts + token counts.
+        self.tokenizer = load_tokenizer_with_transformers(
+            eval_config.tokenizer_path_or_repo_id or eval_config.model_path_or_repo_id,
+            eval_config.model_token,
+        )
+        if not self.tokenizer.pad_token:


GeminiEvalEngine blindly calls load_tokenizer_with_transformers(eval_config.tokenizer_path_or_repo_id or eval_config.model_path_or_repo_id, …). When the model under test is not a HuggingFace repo (e.g. a local plugin path or custom artifact) and the optional --tokenizer-model argument is not supplied, the tokenizer load fails with ValueError before any Gemini calls, so --use-gemini cannot be used on non-HF models. We need either to require/validate a compatible HF tokenizer when use_gemini is enabled or provide an explicit error message explaining that --tokenizer-model must be passed for non-HF models.

Finding type: Type Inconsistency

baz-reviewer · 2025-12-29T10:10:16Z

+        plugin_path = eval_config.model_path_or_repo_id
+        cfg = AutoConfig.from_pretrained(plugin_path)
+        tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name_or_path)
+        if tokenizer.pad_token is None:
+            tokenizer.pad_token = tokenizer.eos_token


AutoTokenizer.from_pretrained is called on cfg.base_model_name_or_path without passing eval_config.model_token, but the private base model needs the same auth token that is later provided to PluginModelForCausalLM. As soon as the plugin's base model is private, tokenizer loading raises an authentication error even when --model-token was supplied.

Suggested change

plugin_path = eval_config.model_path_or_repo_id

cfg = AutoConfig.from_pretrained(plugin_path)

tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name_or_path)

if tokenizer.pad_token is None:

tokenizer.pad_token = tokenizer.eos_token

plugin_path = eval_config.model_path_or_repo_id

cfg = AutoConfig.from_pretrained(plugin_path)

tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name_or_path, token=eval_config.model_token)

if tokenizer.pad_token is None:

tokenizer.pad_token = tokenizer.eos_token

Finding type: Logical Bugs

baz-reviewer · 2025-12-29T10:10:16Z

+from transformers import (
+    PreTrainedTokenizer,
+    PreTrainedTokenizerFast,
+)
 from transformers.data.data_collator import default_data_collator
 from transformers.pipelines import pipeline

+from llm_behavior_eval.evaluation_utils.gemini_eval_engine import GeminiEvalEngine
+from llm_behavior_eval.evaluation_utils.plugin_eval_engine import PluginEvalEngine
 from llm_behavior_eval.evaluation_utils.transformers_eval_engine import (


BaseEvaluator now unconditionally imports GeminiEvalEngine and PluginEvalEngine, creating hard dependencies that crash when those packages are missing—even with --no-use-gemini or --no-use-plugin. Could these imports be conditional?

Suggested change

from transformers import (

PreTrainedTokenizer,

PreTrainedTokenizerFast,

)

from transformers.data.data_collator import default_data_collator

from transformers.pipelines import pipeline

from llm_behavior_eval.evaluation_utils.gemini_eval_engine import GeminiEvalEngine

from llm_behavior_eval.evaluation_utils.plugin_eval_engine import PluginEvalEngine

from llm_behavior_eval.evaluation_utils.transformers_eval_engine import (

from transformers import (

PreTrainedTokenizer,

PreTrainedTokenizerFast,

)

from transformers.data.data_collator import default_data_collator

from transformers.pipelines import pipeline

from llm_behavior_eval.evaluation_utils.transformers_eval_engine import (

Finding types: Logical Bugs

Adding plugin evaluation

610dd6b

RotemHir requested review from benglewis, mishana and shmuelyo as code owners November 25, 2025 11:44

chatgpt-codex-connector Bot reviewed Nov 25, 2025

View reviewed changes

Comment thread llm_behavior_eval/evaluate.py

RotemHir added 4 commits November 25, 2025 12:27

Fixing a bug

217fa37

Merge branch 'main' into llm-41-add-support-for-plugin-models

faa151f

Signed-off-by: RotemHir <rotem@hirundo.io>

Merge branch 'llm-41-add-support-for-plugin-models' of https://github…

4709bce

….com/Hirundo-io/llm-behavior-eval into llm-41-add-support-for-plugin-models

fixing problem

b4b7b9e

benglewis reviewed Nov 26, 2025

View reviewed changes

RotemHir added 3 commits November 27, 2025 12:05

tokenizing in left side before data preprocessing

642b801

changes from pr

003e40d

Adding option to run Gemini only and fixing bug for prism

4fcb29a

cursor Bot reviewed Dec 4, 2025

View reviewed changes

RotemHir added 2 commits December 4, 2025 10:55

fixing bugs

f0fd2e8

staging different bases as background for plugin

daa3dc9

baz-reviewer Bot reviewed Dec 29, 2025

View reviewed changes

shmuelyo assigned RotemHir Feb 26, 2026

-            outputs = model.generate(
-                input_ids=cast("torch.LongTensor", model_input_ids),
-                attention_mask=cast("torch.LongTensor", model_attention),
-                max_new_tokens=self.eval_config.answer_tokens,
-                do_sample=self.eval_config.sample,
-                # temperature=self.eval_config.temperature, # TODO: add temperature
-                pad_token_id=tokenizer.pad_token_id,
-            )
+            if sampling_config.do_sample is None:
+                do_sample = self._get_sample_from_config(self.eval_config, self.is_judge)
+            else:
+                do_sample = sampling_config.do_sample
+            max_new_tokens = self._get_max_new_tokens(self.eval_config, self.is_judge)
+            temperature = sampling_config.temperature
+            if temperature is None:
+                temperature = 1.0 if do_sample else 0.0
+            top_p = sampling_config.top_p if sampling_config.top_p is not None else 1.0
+            top_k = sampling_config.top_k if sampling_config.top_k is not None else 0
+            seed = sampling_config.seed
+            outputs = cast("GenerationMixin", self.model).generate(
+                input_ids=cast("torch.LongTensor", model_input_ids),
+                attention_mask=cast("torch.LongTensor", model_attention),
+                max_new_tokens=max_new_tokens,
+                do_sample=do_sample,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=self.tokenizer.eos_token_id,
+                temperature=temperature,
+                top_p=top_p,
+                top_k=top_k,
+                seed=seed,
+            )

Conversation

RotemHir commented Nov 25, 2025 • edited by baz-reviewer Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Generated description

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

benglewis left a comment

Choose a reason for hiding this comment

Uh oh!

benglewis Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

RotemHir Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

benglewis Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

RotemHir Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benglewis Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

RotemHir Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benglewis Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

RotemHir Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Dec 4, 2025

Choose a reason for hiding this comment

Bug: Missing use_gemini and gemini_model_name in EvaluationConfig

Uh oh!

cursor Bot Dec 4, 2025

Choose a reason for hiding this comment

Bug: Hardcoded debug folder name in results directory

Uh oh!

cursor Bot Dec 4, 2025

Choose a reason for hiding this comment

Bug: Duplicate line in system prompt template

Uh oh!

cursor Bot Dec 4, 2025

Choose a reason for hiding this comment

Bug: Left-padding incompatible with Gemini prompt extraction logic

Uh oh!

baz-reviewer Bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

RotemHir commented Nov 25, 2025 •

edited by baz-reviewer Bot

Loading

RotemHir Nov 27, 2025 •

edited

Loading