Adding plugin evaluation#56
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: RotemHir <rotem@hirundo.io>
….com/Hirundo-io/llm-behavior-eval into llm-41-add-support-for-plugin-models
benglewis
left a comment
There was a problem hiding this comment.
A few important comments / suggestions
| def eval_collator(features): | ||
| batch = default_data_collator(features) | ||
| batch["input_ids"] = batch.pop("test_input_ids") | ||
| batch["attention_mask"] = batch.pop("test_attention_mask") | ||
| return batch |
There was a problem hiding this comment.
Why do you need this? As far as I can see, all datasets are expected to produce test_input_ids and test_attention_mask and the Evaluator classes (e.g. FreeTextBiasEvaluator, FreeTextHaluEvaluator and FreeTextPromptInjectionEvaluator) are all passing the test_input_ids and test_attention_mask arguments to the input_ids and attention_mask arguments for the eval engine which then uses them as normal. I think that YAGNI and it may even break intended behavior
There was a problem hiding this comment.
I don't need it and I didn't add this as far as i know
| if self.eval_config.use_plugin: | ||
| self.eval_engine = PluginEvalEngine( | ||
| self.data_collator, | ||
| self.eval_config, | ||
| ) | ||
| elif self.use_vllm: |
There was a problem hiding this comment.
You will see when you pull that the code now has eval_engine which will make this cleaner
There was a problem hiding this comment.
so to fix this now? do merge again?
| outputs = model.generate( | ||
| input_ids=cast("torch.LongTensor", model_input_ids), | ||
| attention_mask=cast("torch.LongTensor", model_attention), | ||
| max_new_tokens=self.eval_config.answer_tokens, | ||
| do_sample=self.eval_config.sample, | ||
| # temperature=self.eval_config.temperature, # TODO: add temperature | ||
| pad_token_id=tokenizer.pad_token_id, | ||
| ) |
There was a problem hiding this comment.
You'll need to add this before the call:
| outputs = model.generate( | |
| input_ids=cast("torch.LongTensor", model_input_ids), | |
| attention_mask=cast("torch.LongTensor", model_attention), | |
| max_new_tokens=self.eval_config.answer_tokens, | |
| do_sample=self.eval_config.sample, | |
| # temperature=self.eval_config.temperature, # TODO: add temperature | |
| pad_token_id=tokenizer.pad_token_id, | |
| ) | |
| if sampling_config.do_sample is None: | |
| do_sample = self._get_sample_from_config(self.eval_config, self.is_judge) | |
| else: | |
| do_sample = sampling_config.do_sample | |
| max_new_tokens = self._get_max_new_tokens(self.eval_config, self.is_judge) | |
| temperature = sampling_config.temperature | |
| if temperature is None: | |
| temperature = 1.0 if do_sample else 0.0 | |
| top_p = sampling_config.top_p if sampling_config.top_p is not None else 1.0 | |
| top_k = sampling_config.top_k if sampling_config.top_k is not None else 0 | |
| seed = sampling_config.seed | |
| outputs = cast("GenerationMixin", self.model).generate( | |
| input_ids=cast("torch.LongTensor", model_input_ids), | |
| attention_mask=cast("torch.LongTensor", model_attention), | |
| max_new_tokens=max_new_tokens, | |
| do_sample=do_sample, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=self.tokenizer.eos_token_id, | |
| temperature=temperature, | |
| top_p=top_p, | |
| top_k=top_k, | |
| seed=seed, | |
| ) |
There was a problem hiding this comment.
This is something new?
then should i do a new merge?
| self.eval_config = eval_config | ||
| self.device = next(self.model.parameters()).device | ||
|
|
||
| def generate_answers( |
There was a problem hiding this comment.
This essentially looks identical to the TransformersEngine version, are you sure that you need to override it for some reason?
There was a problem hiding this comment.
The generate answers func?
mainly because of this:
model = cast("PluginModelForCausalLM", self.model)
I need to make sure that the generate function being used is the one of the plugin.
there is a possibility that the GenerationMixin will be enough though.... Just to clarify that this is a different generate function and it needs to use the plugin.
| reasoning=reasoning, | ||
| use_vllm=use_vllm, | ||
| use_plugin=use_plugin, | ||
| tokenizer_path_or_repo_id=tokenizer_model, |
There was a problem hiding this comment.
Bug: Missing use_gemini and gemini_model_name in EvaluationConfig
The CLI accepts --use-gemini and --gemini-model-name arguments, but they are never passed to the EvaluationConfig constructor. This means users cannot actually enable Gemini evaluation via the command line - eval_config.use_gemini will always be False (the default), and the GeminiEvalEngine will never be used.
| model_path_or_repo_id = model | ||
| judge_path_or_repo_id = judge_model | ||
| result_dir = Path(__file__).parent / "results" | ||
| result_dir = Path(__file__).parent / "results_400samples" |
There was a problem hiding this comment.
| refuses, not answerable. | ||
| ✘ NO otherwise. | ||
|
|
||
| C. *Equal-attribution answers* |
There was a problem hiding this comment.
|
|
||
| for ids_row, mask_row in zip(input_ids, attention_mask, strict=False): | ||
| valid_len = int(mask_row.sum().item()) | ||
| trimmed_ids = ids_row[:valid_len] |
There was a problem hiding this comment.
Bug: Left-padding incompatible with Gemini prompt extraction logic
The _build_prompts_from_input_ids method extracts tokens using ids_row[:valid_len], which assumes right-padding (padding at the end). However, BaseEvaluator sets tokenizer.padding_side = "left", meaning padding tokens are at the beginning and valid tokens are at the end. This causes GeminiEvalEngine to decode padding tokens instead of the actual prompt content, resulting in empty or garbage prompts being sent to the Gemini API.
| # Load tokenizer based on HF model path. This is *not* Gemini's tokenizer, | ||
| # but your "reference" tokenizer (e.g., plugin base) used for prompts + token counts. | ||
| self.tokenizer = load_tokenizer_with_transformers( | ||
| eval_config.tokenizer_path_or_repo_id or eval_config.model_path_or_repo_id, | ||
| eval_config.model_token, | ||
| ) | ||
| if not self.tokenizer.pad_token: |
There was a problem hiding this comment.
GeminiEvalEngine blindly calls load_tokenizer_with_transformers(eval_config.tokenizer_path_or_repo_id or eval_config.model_path_or_repo_id, …). When the model under test is not a HuggingFace repo (e.g. a local plugin path or custom artifact) and the optional --tokenizer-model argument is not supplied, the tokenizer load fails with ValueError before any Gemini calls, so --use-gemini cannot be used on non-HF models. We need either to require/validate a compatible HF tokenizer when use_gemini is enabled or provide an explicit error message explaining that --tokenizer-model must be passed for non-HF models.
Finding type: Type Inconsistency
| plugin_path = eval_config.model_path_or_repo_id | ||
| cfg = AutoConfig.from_pretrained(plugin_path) | ||
| tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name_or_path) | ||
| if tokenizer.pad_token is None: | ||
| tokenizer.pad_token = tokenizer.eos_token |
There was a problem hiding this comment.
AutoTokenizer.from_pretrained is called on cfg.base_model_name_or_path without passing eval_config.model_token, but the private base model needs the same auth token that is later provided to PluginModelForCausalLM. As soon as the plugin's base model is private, tokenizer loading raises an authentication error even when --model-token was supplied.
| plugin_path = eval_config.model_path_or_repo_id | |
| cfg = AutoConfig.from_pretrained(plugin_path) | |
| tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name_or_path) | |
| if tokenizer.pad_token is None: | |
| tokenizer.pad_token = tokenizer.eos_token | |
| plugin_path = eval_config.model_path_or_repo_id | |
| cfg = AutoConfig.from_pretrained(plugin_path) | |
| tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name_or_path, token=eval_config.model_token) | |
| if tokenizer.pad_token is None: | |
| tokenizer.pad_token = tokenizer.eos_token |
Finding type: Logical Bugs
| from transformers import ( | ||
| PreTrainedTokenizer, | ||
| PreTrainedTokenizerFast, | ||
| ) | ||
| from transformers.data.data_collator import default_data_collator | ||
| from transformers.pipelines import pipeline | ||
|
|
||
| from llm_behavior_eval.evaluation_utils.gemini_eval_engine import GeminiEvalEngine | ||
| from llm_behavior_eval.evaluation_utils.plugin_eval_engine import PluginEvalEngine | ||
| from llm_behavior_eval.evaluation_utils.transformers_eval_engine import ( |
There was a problem hiding this comment.
BaseEvaluator now unconditionally imports GeminiEvalEngine and PluginEvalEngine, creating hard dependencies that crash when those packages are missing—even with --no-use-gemini or --no-use-plugin. Could these imports be conditional?
| from transformers import ( | |
| PreTrainedTokenizer, | |
| PreTrainedTokenizerFast, | |
| ) | |
| from transformers.data.data_collator import default_data_collator | |
| from transformers.pipelines import pipeline | |
| from llm_behavior_eval.evaluation_utils.gemini_eval_engine import GeminiEvalEngine | |
| from llm_behavior_eval.evaluation_utils.plugin_eval_engine import PluginEvalEngine | |
| from llm_behavior_eval.evaluation_utils.transformers_eval_engine import ( | |
| from transformers import ( | |
| PreTrainedTokenizer, | |
| PreTrainedTokenizerFast, | |
| ) | |
| from transformers.data.data_collator import default_data_collator | |
| from transformers.pipelines import pipeline | |
| from llm_behavior_eval.evaluation_utils.transformers_eval_engine import ( |
Finding types: Logical Bugs
User description
Note
Adds Plugin and Gemini inference backends, exposes new CLI/config options for selection/tokenization, and updates the default results directory.
PluginEvalEngineandGeminiEvalEngine; select viaBaseEvaluator(use_plugin/use_gemini), alongside existing Transformers/vLLM.eval_collatorhelper.llm_behavior_eval/evaluate.py):--use-plugin,--use-gemini,--gemini-model-name,--tokenizer-model.results_400samples.EvaluationConfig):use_plugin,use_gemini,gemini_model_name,tokenizer_path_or_repo_idfields; propagate through evaluator creation.Written by Cursor Bugbot for commit 4fcb29a. Configure here.
Generated description
Below is a concise technical summary of the changes proposed in this PR:
graph LR main_("main"):::modified BaseEvaluator_init_("BaseEvaluator.__init__"):::modified PluginEvalEngine_init_("PluginEvalEngine.__init__"):::added GeminiEvalEngine_init_("GeminiEvalEngine.__init__"):::added PluginEvalEngine_generate_answers_("PluginEvalEngine.generate_answers"):::added PluginModelForCausalLM_("PluginModelForCausalLM"):::added PluginEvalEngine_get_first_non_oom_batch_size_("PluginEvalEngine._get_first_non_oom_batch_size"):::added GeminiEvalEngine_generate_answers_("GeminiEvalEngine.generate_answers"):::added GeminiEvalEngine_call_gemini_single_("GeminiEvalEngine._call_gemini_single"):::added genai_Client_("genai.Client"):::added GeminiEvalEngine_build_prompts_from_input_ids_("GeminiEvalEngine._build_prompts_from_input_ids"):::added GeminiEvalEngine_tokenizer_("GeminiEvalEngine.tokenizer"):::added main_ -- "Passes new flags to select evaluation engine." --> BaseEvaluator_init_ BaseEvaluator_init_ -- "Adds PluginEvalEngine initialization on use_plugin flag." --> PluginEvalEngine_init_ BaseEvaluator_init_ -- "Adds GeminiEvalEngine initialization on use_gemini flag." --> GeminiEvalEngine_init_ PluginEvalEngine_generate_answers_ -- "Uses PluginModelForCausalLM for fused answer generation." --> PluginModelForCausalLM_ PluginEvalEngine_get_first_non_oom_batch_size_ -- "Probes batch size using PluginModelForCausalLM with backend." --> PluginModelForCausalLM_ GeminiEvalEngine_generate_answers_ -- "Sends prompts to Gemini API for answer generation." --> GeminiEvalEngine_call_gemini_single_ GeminiEvalEngine_call_gemini_single_ -- "Uses genai.Client to call Gemini API for content." --> genai_Client_ GeminiEvalEngine_build_prompts_from_input_ids_ -- "Decodes input_ids to text prompts for Gemini API." --> GeminiEvalEngine_tokenizer_ classDef added stroke:#15AA7A classDef removed stroke:#CD5270 classDef modified stroke:#EDAC4C linkStyle default stroke:#CBD5E1,font-size:13pxIntroduce new evaluation engines,
PluginEvalEngineandGeminiEvalEngine, to support model inference using a custom plugin model and the Google Gemini API. Update theBaseEvaluatorto dynamically select these engines based on new command-line interface options andEvaluationConfigsettings, while also adjusting tokenizer padding and data collation for compatibility.PluginModelForCausalLMand Google Gemini, integrating them into theBaseEvaluatorfor model inference. This includes adding new CLI options and configuration fields to select and configure these backends, and ensuring proper tokenizer setup with left padding and a neweval_collatorhelper.Modified files (5)
Latest Contributors(2)
FreeTextBiasEvaluatorfor consistency and integration with the new evaluation engines.Modified files (1)
Latest Contributors(2)