Skip to content

Adding plugin evaluation#56

Open
RotemHir wants to merge 10 commits into
mainfrom
llm-41-add-support-for-plugin-models
Open

Adding plugin evaluation#56
RotemHir wants to merge 10 commits into
mainfrom
llm-41-add-support-for-plugin-models

Conversation

@RotemHir
Copy link
Copy Markdown

@RotemHir RotemHir commented Nov 25, 2025

User description

Note

Adds Plugin and Gemini inference backends, exposes new CLI/config options for selection/tokenization, and updates the default results directory.

  • Evaluation Backends:
    • Add PluginEvalEngine and GeminiEvalEngine; select via BaseEvaluator (use_plugin/use_gemini), alongside existing Transformers/vLLM.
    • Ensure tokenizer setup (left padding) and add eval_collator helper.
  • CLI (llm_behavior_eval/evaluate.py):
    • New options: --use-plugin, --use-gemini, --gemini-model-name, --tokenizer-model.
    • Update results path to results_400samples.
  • Config (EvaluationConfig):
    • Add use_plugin, use_gemini, gemini_model_name, tokenizer_path_or_repo_id fields; propagate through evaluator creation.
  • Free‑text Bias Evaluator:
    • Minor prompt text normalization and integration with new engines for answer generation.

Written by Cursor Bugbot for commit 4fcb29a. Configure here.


Generated description

Below is a concise technical summary of the changes proposed in this PR:

graph LR
main_("main"):::modified
BaseEvaluator_init_("BaseEvaluator.__init__"):::modified
PluginEvalEngine_init_("PluginEvalEngine.__init__"):::added
GeminiEvalEngine_init_("GeminiEvalEngine.__init__"):::added
PluginEvalEngine_generate_answers_("PluginEvalEngine.generate_answers"):::added
PluginModelForCausalLM_("PluginModelForCausalLM"):::added
PluginEvalEngine_get_first_non_oom_batch_size_("PluginEvalEngine._get_first_non_oom_batch_size"):::added
GeminiEvalEngine_generate_answers_("GeminiEvalEngine.generate_answers"):::added
GeminiEvalEngine_call_gemini_single_("GeminiEvalEngine._call_gemini_single"):::added
genai_Client_("genai.Client"):::added
GeminiEvalEngine_build_prompts_from_input_ids_("GeminiEvalEngine._build_prompts_from_input_ids"):::added
GeminiEvalEngine_tokenizer_("GeminiEvalEngine.tokenizer"):::added
main_ -- "Passes new flags to select evaluation engine." --> BaseEvaluator_init_
BaseEvaluator_init_ -- "Adds PluginEvalEngine initialization on use_plugin flag." --> PluginEvalEngine_init_
BaseEvaluator_init_ -- "Adds GeminiEvalEngine initialization on use_gemini flag." --> GeminiEvalEngine_init_
PluginEvalEngine_generate_answers_ -- "Uses PluginModelForCausalLM for fused answer generation." --> PluginModelForCausalLM_
PluginEvalEngine_get_first_non_oom_batch_size_ -- "Probes batch size using PluginModelForCausalLM with backend." --> PluginModelForCausalLM_
GeminiEvalEngine_generate_answers_ -- "Sends prompts to Gemini API for answer generation." --> GeminiEvalEngine_call_gemini_single_
GeminiEvalEngine_call_gemini_single_ -- "Uses genai.Client to call Gemini API for content." --> genai_Client_
GeminiEvalEngine_build_prompts_from_input_ids_ -- "Decodes input_ids to text prompts for Gemini API." --> GeminiEvalEngine_tokenizer_
classDef added stroke:#15AA7A
classDef removed stroke:#CD5270
classDef modified stroke:#EDAC4C
linkStyle default stroke:#CBD5E1,font-size:13px
Loading

Introduce new evaluation engines, PluginEvalEngine and GeminiEvalEngine, to support model inference using a custom plugin model and the Google Gemini API. Update the BaseEvaluator to dynamically select these engines based on new command-line interface options and EvaluationConfig settings, while also adjusting tokenizer padding and data collation for compatibility.

TopicDetails
New Eval Backends Implement new evaluation engines for PluginModelForCausalLM and Google Gemini, integrating them into the BaseEvaluator for model inference. This includes adding new CLI options and configuration fields to select and configure these backends, and ensuring proper tokenizer setup with left padding and a new eval_collator helper.
Modified files (5)
  • llm_behavior_eval/evaluation_utils/plugin_eval_engine.py
  • llm_behavior_eval/evaluation_utils/gemini_eval_engine.py
  • llm_behavior_eval/evaluation_utils/eval_config.py
  • llm_behavior_eval/evaluation_utils/base_evaluator.py
  • llm_behavior_eval/evaluate.py
Latest Contributors(2)
UserCommitDate
shmuelyoLLM-51-Make-test-model...December 02, 2025
blewis@hirundo.ioLLM-49-Adopt-.jsonl-fo...December 02, 2025
Normalize Bias Prompts Normalize prompt text within the FreeTextBiasEvaluator for consistency and integration with the new evaluation engines.
Modified files (1)
  • llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py
Latest Contributors(2)
UserCommitDate
blewis@hirundo.ioLLM-49-Adopt-.jsonl-fo...December 02, 2025
tomer@hirundo.ioLLM-27-Support-multi-m...October 20, 2025
This pull request is reviewed by Baz. Review like a pro on (Baz).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread llm_behavior_eval/evaluate.py
Copy link
Copy Markdown
Contributor

@benglewis benglewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few important comments / suggestions

Comment on lines +71 to +75
def eval_collator(features):
batch = default_data_collator(features)
batch["input_ids"] = batch.pop("test_input_ids")
batch["attention_mask"] = batch.pop("test_attention_mask")
return batch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this? As far as I can see, all datasets are expected to produce test_input_ids and test_attention_mask and the Evaluator classes (e.g. FreeTextBiasEvaluator, FreeTextHaluEvaluator and FreeTextPromptInjectionEvaluator) are all passing the test_input_ids and test_attention_mask arguments to the input_ids and attention_mask arguments for the eval engine which then uses them as normal. I think that YAGNI and it may even break intended behavior

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need it and I didn't add this as far as i know

Comment on lines +99 to +104
if self.eval_config.use_plugin:
self.eval_engine = PluginEvalEngine(
self.data_collator,
self.eval_config,
)
elif self.use_vllm:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will see when you pull that the code now has eval_engine which will make this cleaner

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so to fix this now? do merge again?

Comment thread llm_behavior_eval/evaluation_utils/eval_config.py
Comment thread llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py
Comment thread llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py
Comment thread llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py Outdated
Comment on lines +56 to +63
outputs = model.generate(
input_ids=cast("torch.LongTensor", model_input_ids),
attention_mask=cast("torch.LongTensor", model_attention),
max_new_tokens=self.eval_config.answer_tokens,
do_sample=self.eval_config.sample,
# temperature=self.eval_config.temperature, # TODO: add temperature
pad_token_id=tokenizer.pad_token_id,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to add this before the call:

Suggested change
outputs = model.generate(
input_ids=cast("torch.LongTensor", model_input_ids),
attention_mask=cast("torch.LongTensor", model_attention),
max_new_tokens=self.eval_config.answer_tokens,
do_sample=self.eval_config.sample,
# temperature=self.eval_config.temperature, # TODO: add temperature
pad_token_id=tokenizer.pad_token_id,
)
if sampling_config.do_sample is None:
do_sample = self._get_sample_from_config(self.eval_config, self.is_judge)
else:
do_sample = sampling_config.do_sample
max_new_tokens = self._get_max_new_tokens(self.eval_config, self.is_judge)
temperature = sampling_config.temperature
if temperature is None:
temperature = 1.0 if do_sample else 0.0
top_p = sampling_config.top_p if sampling_config.top_p is not None else 1.0
top_k = sampling_config.top_k if sampling_config.top_k is not None else 0
seed = sampling_config.seed
outputs = cast("GenerationMixin", self.model).generate(
input_ids=cast("torch.LongTensor", model_input_ids),
attention_mask=cast("torch.LongTensor", model_attention),
max_new_tokens=max_new_tokens,
do_sample=do_sample,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
temperature=temperature,
top_p=top_p,
top_k=top_k,
seed=seed,
)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something new?
then should i do a new merge?

Comment thread llm_behavior_eval/evaluation_utils/plugin_eval_engine.py Outdated
self.eval_config = eval_config
self.device = next(self.model.parameters()).device

def generate_answers(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially looks identical to the TransformersEngine version, are you sure that you need to override it for some reason?

Copy link
Copy Markdown
Author

@RotemHir RotemHir Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generate answers func?
mainly because of this:
model = cast("PluginModelForCausalLM", self.model)

I need to make sure that the generate function being used is the one of the plugin.
there is a possibility that the GenerationMixin will be enough though.... Just to clarify that this is a different generate function and it needs to use the plugin.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment @cursor review or bugbot run to trigger another review on this PR

reasoning=reasoning,
use_vllm=use_vllm,
use_plugin=use_plugin,
tokenizer_path_or_repo_id=tokenizer_model,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing use_gemini and gemini_model_name in EvaluationConfig

The CLI accepts --use-gemini and --gemini-model-name arguments, but they are never passed to the EvaluationConfig constructor. This means users cannot actually enable Gemini evaluation via the command line - eval_config.use_gemini will always be False (the default), and the GeminiEvalEngine will never be used.

Fix in Cursor Fix in Web

Comment thread llm_behavior_eval/evaluate.py Outdated
model_path_or_repo_id = model
judge_path_or_repo_id = judge_model
result_dir = Path(__file__).parent / "results"
result_dir = Path(__file__).parent / "results_400samples"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Hardcoded debug folder name in results directory

The results directory is changed from results to results_400samples. This appears to be debug or experiment-specific code that was accidentally committed. All evaluation results will now be written to a non-standard directory name.

Fix in Cursor Fix in Web

refuses, not answerable.
✘ NO otherwise.

C. *Equal-attribution answers*
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Duplicate line in system prompt template

The line C. *Equal-attribution answers* appears twice consecutively (lines 107 and 108). This is likely a bad merge artifact as noted in the PR discussion, resulting in a malformed prompt that will be sent to the judge model.

Fix in Cursor Fix in Web


for ids_row, mask_row in zip(input_ids, attention_mask, strict=False):
valid_len = int(mask_row.sum().item())
trimmed_ids = ids_row[:valid_len]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Left-padding incompatible with Gemini prompt extraction logic

The _build_prompts_from_input_ids method extracts tokens using ids_row[:valid_len], which assumes right-padding (padding at the end). However, BaseEvaluator sets tokenizer.padding_side = "left", meaning padding tokens are at the beginning and valid tokens are at the end. This causes GeminiEvalEngine to decode padding tokens instead of the actual prompt content, resulting in empty or garbage prompts being sent to the Gemini API.

Fix in Cursor Fix in Web

Comment on lines +43 to +49
# Load tokenizer based on HF model path. This is *not* Gemini's tokenizer,
# but your "reference" tokenizer (e.g., plugin base) used for prompts + token counts.
self.tokenizer = load_tokenizer_with_transformers(
eval_config.tokenizer_path_or_repo_id or eval_config.model_path_or_repo_id,
eval_config.model_token,
)
if not self.tokenizer.pad_token:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeminiEvalEngine blindly calls load_tokenizer_with_transformers(eval_config.tokenizer_path_or_repo_id or eval_config.model_path_or_repo_id, …). When the model under test is not a HuggingFace repo (e.g. a local plugin path or custom artifact) and the optional --tokenizer-model argument is not supplied, the tokenizer load fails with ValueError before any Gemini calls, so --use-gemini cannot be used on non-HF models. We need either to require/validate a compatible HF tokenizer when use_gemini is enabled or provide an explicit error message explaining that --tokenizer-model must be passed for non-HF models.


Finding type: Type Inconsistency

Comment on lines +31 to +35
plugin_path = eval_config.model_path_or_repo_id
cfg = AutoConfig.from_pretrained(plugin_path)
tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name_or_path)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoTokenizer.from_pretrained is called on cfg.base_model_name_or_path without passing eval_config.model_token, but the private base model needs the same auth token that is later provided to PluginModelForCausalLM. As soon as the plugin's base model is private, tokenizer loading raises an authentication error even when --model-token was supplied.

Suggested change
plugin_path = eval_config.model_path_or_repo_id
cfg = AutoConfig.from_pretrained(plugin_path)
tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name_or_path)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
plugin_path = eval_config.model_path_or_repo_id
cfg = AutoConfig.from_pretrained(plugin_path)
tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name_or_path, token=eval_config.model_token)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

Finding type: Logical Bugs

Comment on lines +15 to 24
from transformers import (
PreTrainedTokenizer,
PreTrainedTokenizerFast,
)
from transformers.data.data_collator import default_data_collator
from transformers.pipelines import pipeline

from llm_behavior_eval.evaluation_utils.gemini_eval_engine import GeminiEvalEngine
from llm_behavior_eval.evaluation_utils.plugin_eval_engine import PluginEvalEngine
from llm_behavior_eval.evaluation_utils.transformers_eval_engine import (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseEvaluator now unconditionally imports GeminiEvalEngine and PluginEvalEngine, creating hard dependencies that crash when those packages are missing—even with --no-use-gemini or --no-use-plugin. Could these imports be conditional?

Suggested change
from transformers import (
PreTrainedTokenizer,
PreTrainedTokenizerFast,
)
from transformers.data.data_collator import default_data_collator
from transformers.pipelines import pipeline
from llm_behavior_eval.evaluation_utils.gemini_eval_engine import GeminiEvalEngine
from llm_behavior_eval.evaluation_utils.plugin_eval_engine import PluginEvalEngine
from llm_behavior_eval.evaluation_utils.transformers_eval_engine import (
from transformers import (
PreTrainedTokenizer,
PreTrainedTokenizerFast,
)
from transformers.data.data_collator import default_data_collator
from transformers.pipelines import pipeline
from llm_behavior_eval.evaluation_utils.transformers_eval_engine import (

Finding types: Logical Bugs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants