Skip to content

LLM-63: Add API-backed support for evaluated models and judges ( model_engine='api' and judge_engine='api')#93

Open
mishana wants to merge 68 commits into
mainfrom
codex/add-support-for-api-gated-judge-models
Open

LLM-63: Add API-backed support for evaluated models and judges ( model_engine='api' and judge_engine='api')#93
mishana wants to merge 68 commits into
mainfrom
codex/add-support-for-api-gated-judge-models

Conversation

@mishana
Copy link
Copy Markdown
Contributor

@mishana mishana commented Jan 19, 2026

User description

Motivation

  • Allow the evaluated model (not just the judge) to run via hosted/API providers using the same LiteLLM-backed code path used for API judges.
  • Avoid forcing HF tokenizer/model downloads for API model identifiers while still permitting prompt decoding via a supplied HF tokenizer when needed.

Description

  • Introduced ApiEvalEngine (llm_behavior_eval/evaluation_utils/api_eval_engine.py) that supports both judge (is_judge=True) and evaluated-model (is_judge=False) usage and decodes prompts with a tokenizer when running API-backed evaluated models.
  • Wire model_engine='api' into the evaluator by selecting ApiEvalEngine in BaseEvaluator when model_engine is api and by initializing the evaluated-model tokenizer from model_tokenizer_path_or_repo_id when present.
  • Added CLI/Config support for --model-engine api and --model-tokenizer and extended EvaluationConfig to include model_tokenizer_path_or_repo_id and validation that model_engine='api' (or judge_engine='api') cannot be combined with inference_engine.
  • Adapted judge formatting and processing to support API judges by adding format_judge_messages, skipping judge tokenizer setup for API judges, and updating _process_judge_prompts_batch / run_judge_with_backoff to accept JudgePrompt types and to call generate_answers_from_prompts for API engines.
  • Documentation and packaging: added --model-engine api docs to README.md and an api optional dependency group in pyproject.toml for litellm, openai, anthropic, boto3, and google-cloud-aiplatform.
  • Tests: added tests/test_api_eval_engine.py and extended tests/test_base_evaluator.py with test_get_grading_context_supports_api_judge_engine and test_api_model_engine_uses_api_eval_engine to verify LiteLLM completion kwargs and engine selection.

Testing

  • Added unit tests tests/test_api_eval_engine.py and updates to tests/test_base_evaluator.py that mock litellm and ApiEvalEngine behavior and assert API call construction and evaluator wiring, but no test suite (pytest) was executed in this rollout.
  • Project files were updated and committed successfully; dependency lock refresh attempted previously failed due to network issues, so CI should run the new tests and validate optional dependency resolution.

Codex Task


Note

Medium Risk
Refactors the core evaluation flow (dataset preprocessing, prompt formatting, and engine interfaces) while adding a new remote-API execution path; mistakes could change evaluation behavior or batching/seed determinism across backends.

Overview
Adds an API-backed inference option for both the evaluated model and judge (--model-engine api, --judge-engine api, or --inference-engine api) via a new LiteLLM-based ApiEvalEngine, including configurable concurrency and provider-side message/token trimming.

Refactors evaluation to a prompt-first pipeline: datasets now preprocess into raw test_messages/text fields with a raw-text collator, and TransformersEvalEngine/VllmEvalEngine are updated to generate from prompts (plus new optional --model-tokenizer/--judge-tokenizer overrides) with config validation to prevent tokenizer overrides when using API engines.

Updates packaging/docs/tests: adds API optional dependency groups in pyproject.toml, documents provider setup and usage in README.md, adds extensive unit tests for API and prompt-engine wiring, introduces an opt-in gpu_regression pytest marker, loads .env in the CLI when python-dotenv is available, and ignores top-level results/ in .gitignore.

Written by Cursor Bugbot for commit 6b9fcf7. This will update automatically on new commits. Configure here.


Generated description

Below is a concise technical summary of the changes proposed in this PR:
Add API-based evaluated-model and judge support through the new ApiEvalEngine, wiring CLI/config flags, README instructions, optional dependencies, and tests for LiteLLM-backed providers. Refactor the prompt-first evaluator pipeline (dataset preprocessing, BaseEvaluator, and transformers/vLLM adapters) to emit raw messages reused by all engines while maintaining tokenizer overrides where appropriate.

TopicDetails
Prompt pipeline Refactor the prompt-centric preprocessing/evaluator flow so datasets emit raw test_messages/text fields, BaseEvaluator and engines consume prompts, and free-text evaluators/tests reuse the shared prompt pipeline.
Modified files (15)
  • llm_behavior_eval/evaluation_utils/base_evaluator.py
  • llm_behavior_eval/evaluation_utils/custom_dataset.py
  • llm_behavior_eval/evaluation_utils/eval_engine.py
  • llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py
  • llm_behavior_eval/evaluation_utils/free_text_hallu_evaluator.py
  • llm_behavior_eval/evaluation_utils/free_text_injection_evaluator.py
  • llm_behavior_eval/evaluation_utils/transformers_eval_engine.py
  • llm_behavior_eval/evaluation_utils/vllm_eval_engine.py
  • tests/test_base_evaluator.py
  • tests/test_custom_dataset.py
  • tests/test_eval_engine_interface.py
  • tests/test_eval_engines.py
  • tests/test_free_text_bias_evaluator.py
  • tests/test_gpu_regression.py
  • tests/test_util_functions.py
Latest Contributors(2)
UserCommitDate
blewis@hirundo.ioImprove API call robus...May 07, 2026
shmuelyoLLM-85: Add support tr...May 06, 2026
Other Other files
Modified files (8)
  • .gitignore
  • AGENTS.md
  • llm_behavior_eval/evaluation_utils/litellm_utils.py
  • llm_behavior_eval/evaluation_utils/util_functions.py
  • llm_behavior_eval/evaluation_utils/vllm_config.py
  • tests/conftest.py
  • tests/test_evaluate_cli.py
  • uv.lock
Latest Contributors(2)
UserCommitDate
blewis@hirundo.ioImprove API call robus...May 07, 2026
mishana4life@gmail.comMerge main into codex/...March 02, 2026
API eval engine Implement API-backed model/judge execution via ApiEvalEngine, CLI/config validation, packaging/docs, and tests to exercise LiteLLM provider wiring and error handling.
Modified files (8)
  • README.md
  • llm_behavior_eval/evaluate.py
  • llm_behavior_eval/evaluation_utils/api_eval_engine.py
  • llm_behavior_eval/evaluation_utils/eval_config.py
  • pyproject.toml
  • tests/test_api_eval_engine.py
  • tests/test_base_evaluator.py
  • tests/test_eval_config.py
Latest Contributors(2)
UserCommitDate
blewis@hirundo.ioImprove API call robus...May 07, 2026
shmuelyoLLM-85: Add support tr...May 06, 2026
This pull request is reviewed by Baz. Review like a pro on (Baz).

@mishana
Copy link
Copy Markdown
Contributor Author

mishana commented Jan 19, 2026

@BugBot run

Comment thread llm_behavior_eval/evaluation_utils/api_eval_engine.py Outdated
- Added dotenv support for environment variable loading.
- Updated API evaluation engine to handle batch processing with concurrency.
- Introduced raw text collator for API models without tokenization.
- Enhanced dataset preprocessing for API models, allowing for raw input handling.
- Improved judge tokenizer preparation and dataset rebuilding for API evaluations.
- Refactored evaluator methods to streamline prompt generation and processing.
- Set default values for config and load formats in vLLM configuration.
- Introduced batch completion support in the FakeLiteLLM for improved testing.
- Added tests to validate API model and judge engine configurations against incompatible inference engines.
- Enhanced test coverage for API evaluation engine's handling of default batch sizes and missing tokenizer scenarios.
- Refactored evaluator tests to utilize a stubbed API evaluation engine for better isolation and reliability.
@mishana mishana marked this pull request as ready for review January 27, 2026 18:54
@mishana
Copy link
Copy Markdown
Contributor Author

mishana commented Jan 27, 2026

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 986cceb18c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread llm_behavior_eval/evaluation_utils/api_eval_engine.py Outdated
@benglewis benglewis changed the title Add API-backed support for evaluated models (model_engine='api') LLM-63: Add API-backed support for evaluated models (model_engine='api') Jan 27, 2026
- Added detailed instructions for evaluating models via remote API endpoints, including support for Azure OpenAI, vLLM, OpenAI, Google Vertex AI, and Anthropic.
- Included environment variable setup for Azure and other API services.
- Updated `uv.lock` to reflect new package dependencies and version adjustments, including `boto3`, `google-api-core`, and `litellm`.
- Adjusted Python version requirements for several packages to ensure compatibility.
- Introduced `EvalDataset` type to standardize dataset handling in evaluation engines.
- Updated dataset type annotations in `ApiEvalEngine`, `TransformersEvalEngine`, and `VllmEvalEngine` to use `EvalDataset`.
- Replaced `Dataset` with `HFDataset` in `BaseEvaluator` for improved clarity.
- Enhanced type safety by refining type hints in various methods and functions.
- Added dotenv support for environment variable loading in `evaluate.py` to streamline configuration.
- Reformatted conditional statements and logging calls for better clarity in `base_evaluator.py` and `custom_dataset.py`.
- Enhanced readability of batch decoding in `free_text_hallu_evaluator.py` and `free_text_injection_evaluator.py`.
- Simplified return statement in `test_api_eval_engine.py` for cleaner code.
- Improved constructor formatting in `test_base_evaluator.py` for consistency.
@mishana
Copy link
Copy Markdown
Contributor Author

mishana commented Jan 27, 2026

@BugBot run

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

…oken usage

- Refactored the max_tokens assignment in `ApiEvalEngine` to utilize a dedicated method for improved clarity and flexibility.
- Introduced a new test in `test_api_eval_engine.py` to verify that the model correctly uses the configured max_answer_tokens during answer generation.
Copy link
Copy Markdown
Contributor

@benglewis benglewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few important comments

Comment thread llm_behavior_eval/evaluate.py Outdated
Comment thread llm_behavior_eval/evaluate.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/api_eval_engine.py
Comment thread llm_behavior_eval/evaluation_utils/api_eval_engine.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/api_eval_engine.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/eval_config.py
Comment thread llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluate.py Outdated
Comment thread llm_behavior_eval/evaluate.py
shmuelyo
shmuelyo previously approved these changes Jan 28, 2026
Copy link
Copy Markdown
Contributor

@shmuelyo shmuelyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor comments beyond @benglewis's

Copy link
Copy Markdown
Contributor

@orr-hirundo orr-hirundo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of changes in this PR, I think it's worth running similar evals with the same judge model via API and locally, just to make sure we're not breaking anything - specifically with the parts around rebuilding the judge dataset, and the code inside base_evaluator requiring the engine to be api_engine.

Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 29, 2026

Greptile Overview

Confidence Score: 4/5

  • This PR is safe to merge with careful attention to vLLM config compatibility and integration testing
  • Score reflects well-structured implementation with strong unit test coverage and comprehensive documentation, but deducted 1 point due to (1) vLLM config defaults change that needs validation with existing setups, and (2) lack of integration tests with real API providers
  • Pay close attention to llm_behavior_eval/evaluation_utils/vllm_config.py for the config defaults change

Important Files Changed

Filename Overview
llm_behavior_eval/evaluation_utils/api_eval_engine.py New API evaluation engine using LiteLLM for both model and judge inference with batch processing and environment-based concurrency
llm_behavior_eval/evaluation_utils/eval_config.py Added model_tokenizer_path_or_repo_id, model_engine, judge_engine fields with validation preventing API engines from combining with inference_engine
llm_behavior_eval/evaluation_utils/base_evaluator.py Extended evaluator to support API engines with raw-text collator, format_judge_messages helper, and judge dataset rebuilding for API model + tokenizer-based judge scenarios
tests/test_api_eval_engine.py Comprehensive unit tests for ApiEvalEngine covering LiteLLM kwargs, tokenizer handling, batch sizing, and environment variables
llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py Refactored generate() and grade() to handle both tokenized and raw message batches, combined API judge prompts for better throughput

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as evaluate.py
    participant Factory as EvaluateFactory
    participant Evaluator as BaseEvaluator
    participant ApiEngine as ApiEvalEngine
    participant LiteLLM
    participant APIProvider as API Provider<br/>(OpenAI/Anthropic/etc)

    User->>CLI: llm-behavior-eval model behavior<br/>--model-engine api
    CLI->>Factory: create_evaluator(eval_config, dataset_config)
    Factory->>Evaluator: __init__(eval_config, dataset_config)
    
    alt model_engine == "api"
        Evaluator->>ApiEngine: ApiEvalEngine(eval_config, is_judge=False)
        ApiEngine->>ApiEngine: _load_litellm()
        ApiEngine->>ApiEngine: _load_model_tokenizer()<br/>(if model_tokenizer_path_or_repo_id set)
    end
    
    Evaluator->>Evaluator: prepare_dataloader()
    
    alt api_raw_mode (no tokenizer)
        Evaluator->>Evaluator: use raw_text_collator
        Note over Evaluator: Dataset contains message lists,<br/>not token IDs
    else with tokenizer
        Evaluator->>Evaluator: tokenize prompts
        Note over Evaluator: Dataset contains input_ids
    end
    
    User->>Evaluator: evaluate()
    Evaluator->>Evaluator: generate()
    
    loop For each batch
        alt api_raw_mode
            Evaluator->>ApiEngine: generate_answers_from_prompts(messages)
        else tokenized
            Evaluator->>ApiEngine: generate_answers(input_ids, attention_mask)
            ApiEngine->>ApiEngine: decode tokens to prompts
        end
        
        ApiEngine->>ApiEngine: _normalize_messages(prompt)
        ApiEngine->>LiteLLM: batch_completion(model, messages, **kwargs)
        
        loop Concurrent requests (LLM_EVAL_API_CONCURRENCY)
            LiteLLM->>APIProvider: HTTP request
            APIProvider-->>LiteLLM: response
        end
        
        LiteLLM-->>ApiEngine: responses
        ApiEngine->>ApiEngine: _extract_content(response)
        ApiEngine-->>Evaluator: answers
    end
    
    Evaluator->>Evaluator: free_test_model()
    Evaluator->>Evaluator: get_grading_context()
    
    alt judge_engine == "api"
        Evaluator->>ApiEngine: ApiEvalEngine(eval_config, is_judge=True)
    else judge_engine != "api" && api_raw_mode
        Evaluator->>Evaluator: _build_tokenized_dataset_for_judge()
        Note over Evaluator: Rebuild dataset with judge tokenizer<br/>for tokenizer-based judge
    end
    
    Evaluator->>Evaluator: grade(generations, judge_engine)
    Evaluator->>Evaluator: format_judge_messages(messages)
    
    alt judge_engine == "api"
        Note over Evaluator: Returns message list for API
        Evaluator->>ApiEngine: generate_answers_from_prompts(judge_prompts)
        ApiEngine->>LiteLLM: batch_completion(...)
        LiteLLM->>APIProvider: concurrent judge requests
        APIProvider-->>LiteLLM: judge responses
        LiteLLM-->>ApiEngine: results
    else judge_engine != "api"
        Note over Evaluator: Returns formatted string for tokenizer
        Evaluator->>Evaluator: tokenize judge prompts
        Evaluator->>Evaluator: run_judge_with_backoff()
    end
    
    Evaluator-->>User: evaluation results
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 29, 2026

Additional Comments (1)

llm_behavior_eval/evaluation_utils/vllm_config.py
check that existing vLLM configurations still work after changing defaults from None to "auto"

Prompt To Fix With AI
This is a comment left during a code review.
Path: llm_behavior_eval/evaluation_utils/vllm_config.py
Line: 11:12

Comment:
check that existing vLLM configurations still work after changing defaults from `None` to `"auto"`

How can I resolve this? If you propose a fix, please make it concise.

@benglewis benglewis self-assigned this Jan 29, 2026
@benglewis benglewis changed the title LLM-63: Add API-backed support for evaluated models (model_engine='api') LLM-63: Add API-backed support for evaluated models (model_engine='api') Feb 1, 2026
Copy link
Copy Markdown
Contributor

@orr-hirundo orr-hirundo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, some comments

Preprocess a batch for API models without a tokenizer.

Builds message lists and keeps ground-truth fields as strings.
Since no tokenizer is available in this path, truncation is approximated by
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, if we are overestimating the number of tokens - will the API model fail or just truncate even shorter automatically?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate? Do you mean "overestimating" as in our count > actual count, and then we truncate too much? If so, I don't see why the API model should fail. @orr-hirundo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following a clarification:

Thanks, that makes sense. If we under-truncate, we first attempt message-level trimming via LiteLLM; if the request is still over the provider’s context limit, most providers will return a context-length error (some may auto-truncate, provider-dependent). We currently don’t rely on provider auto-truncation or implement an extra retry-with-smaller-prompt loop in this path.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit f147ab9 addressed this comment by providing an explicit clarification in the thread: the system does not rely on provider auto-truncation. If the prompt is still over the context limit after LiteLLM trimming, most providers return a context-length error (with auto-truncation being provider-dependent), and there is no retry-with-smaller-prompt loop implemented.

Comment thread llm_behavior_eval/evaluation_utils/custom_dataset.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/eval_engine.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/transformers_eval_engine.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread llm_behavior_eval/evaluation_utils/custom_dataset.py
Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py
Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/transformers_eval_engine.py Outdated
Comment thread llm_behavior_eval/evaluation_utils/transformers_eval_engine.py
Comment thread llm_behavior_eval/evaluation_utils/transformers_eval_engine.py
mishana and others added 6 commits March 3, 2026 21:39
# Conflicts:
#	llm_behavior_eval/evaluate.py
#	llm_behavior_eval/evaluation_utils/base_evaluator.py
#	llm_behavior_eval/evaluation_utils/custom_dataset.py
#	llm_behavior_eval/evaluation_utils/eval_engine.py
#	llm_behavior_eval/evaluation_utils/transformers_eval_engine.py
#	llm_behavior_eval/evaluation_utils/util_functions.py
#	llm_behavior_eval/evaluation_utils/vllm_eval_engine.py
#	pyproject.toml
#	uv.lock
Fix various API bugs
Improve error handling by adding retries of up-to 3 times of transient
errors
Raise on non-transient errors rather than hiding them silently
Make all of this configurable
Make AGENTS.md confuse agents less
Comment on lines +455 to +458
custom_dataset = CustomDataset(
self.dataset_config.file_path, self.dataset_config.dataset_type
)
preprocess_config = self.dataset_config.preprocess_config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepare_dataset_for_grading duplicates the dataset-prep flow in prepare_dataloader, so should we extract the shared creation/selection into a helper used by both to keep shuffle, truncation, and selected indices in sync?

Finding type: Code Dedup and Conventions | Severity: 🟢 Low

Comment on lines 650 to 654
"accuracy": accuracy,
"error": 1 - accuracy,
"empty_responses": float(empty_responses),
"api_error_responses": float(api_error_responses),
"num_samples": float(self.num_samples),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api_error_responses is added to mlflow_metrics unconditionally, should we gate it behind api_error_responses > 0 (or only log for API-backed evaluators) so the emitted metric schema doesn’t change for non-API evaluators?

Finding type: Breaking Changes | Severity: 🟠 Medium


Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluation_utils/base_evaluator.py around lines 649-654, in the
`save_results(...)` MLflow logging block, the code currently adds
`"api_error_responses": float(api_error_responses)` unconditionally (even when the
default is 0). Change this to only log the `api_error_responses` metric when
`api_error_responses > 0` (or alternatively only when running an API-backed evaluator),
so non-API evaluators keep the previous MLflow metric schema. Then update/verify any
tests that assert exact MLflow metric keys (e.g., tests/test_mlflow_integration.py) to
confirm the metric dict remains unchanged for default/non-API runs.

Comment on lines 706 to +708
evaluator.update_dataset_config(dataset_config)
if judge is not None:
judge.set_dataset(evaluator.eval_dataset)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In multi-dataset runs, judge.set_dataset(...) is only followed by dataset config updates, so judge preprocess limits can stay stuck on the context’s initial dataset while later datasets may need different trimming — should we call judge.set_preprocess_limits(...) after each dataset switch?

Finding type: Logical Bugs | Severity: 🔴 High


Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluate.py around lines 706-708 inside the “Grading loop” (the
`with evaluator.get_grading_context() as judge:` block), update the judge’s
preprocess/token limits when switching datasets. Currently you call
`evaluator.update_dataset_config(dataset_config)` and then
`judge.set_dataset(evaluator.eval_dataset)`, but you never refresh preprocess limits, so
multi-dataset runs reuse limits from the dataset bound when the grading context opened.
Refactor by adding a call to the judge-side method that updates preprocess limits (e.g.,
`judge.set_preprocess_limits(...)`) immediately after
`judge.set_dataset(evaluator.eval_dataset)` for each dataset, using the values derived
from the just-updated `evaluator.eval_dataset` / its preprocess config, before calling
`evaluator.grade(...)`. Ensure the call happens every iteration of the zip loop so API
judges trim prompts consistently per dataset.

Comment on lines +207 to +211
def _sleep_before_retry(self, retry_count: int) -> None:
backoff = max(0.0, self.eval_config.api_retry_backoff_seconds)
if backoff <= 0:
return
time.sleep(backoff * (2 ** max(0, retry_count - 1)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_sleep_before_retry() uses an uncapped exponential delay without jitter, so should we switch to bounded exponential backoff with jitter (min/max delay) before each retry?

Finding type: validate endpoints and retries | Severity: 🟢 Low


Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluation_utils/api_eval_engine.py around lines 207-211, the
`_sleep_before_retry(retry_count)` method currently computes an unbounded exponential
backoff (`backoff * 2**(...)`) and never caps or adds jitter. Refactor it to use a
bounded exponential backoff with jitter: introduce/consume configuration for minimum
delay, maximum delay, and jitter amount (or reasonable constants if config doesn’t
exist), compute `sleep_time` as a capped base delay plus randomized jitter, and then
call `time.sleep(sleep_time)` before retrying. Ensure you import `random` if needed and
keep the retry_count-to-exponent mapping consistent with “retry_count=1 is the first
retry” (adjust the exponent offset if required).

Comment on lines +220 to +223
error_record: ApiGenerationError = {
"api_error_type": type(error).__name__,
"api_error_message": str(error),
"api_error_kind": error_kind,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api_error_message stores str(error) verbatim in generation_errors, then merges into generations.jsonl/responses.json via resp.update(generation_error, so should we redact or whitelist before persisting (or truncate to a safe summary)?

Finding type: Basic Security Patterns | Severity: 🔴 High


Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluation_utils/api_eval_engine.py around lines 213-237, inside the
`_record_api_skip` method, stop persisting `str(error)` verbatim into the
`api_error_message` field. Refactor by adding a small sanitizer helper (e.g.,
`_safe_error_message(error: Exception)`) that truncates to a short length and redacts
common secrets/patterns (API keys, bearer/ad tokens,
`Authorization`/`api_key`/`token`-like substrings) before storing. Then update
`_record_api_skip` to use the sanitized message so the redacted value is what gets
written to `generations.jsonl` and later merged.

Comment on lines +414 to +418
skipped_indices = {
index
for index, error in enumerate(generation_errors)
if error is not None and bool(error.get("api_error_skipped"))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generation_errors from generations.jsonl is only cast in _collect_generations() but _grade_impl() assumes it’s a dict and calls error.get(...), so can we validate/normalize the loaded generation_errors before the loop?

Finding type: Type Inconsistency | Severity: 🟢 Low


Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py around lines 414-418
inside FreeTextBiasEvaluator._grade_impl, you currently build generation_errors via
`generation_record.generation_errors or [...]` and then call
`error.get("api_error_skipped")` when computing skipped_indices. Refactor by
validating/normalizing generation_errors before the skipped_indices comprehension:
ensure it is a list with length matching len(generation_record.answers), and for each
entry only keep it if it is dict-like (has a .get); otherwise treat it as None. If the
overall type/length is clearly wrong, fail fast with a clear schema error (ValueError)
rather than letting grading crash with AttributeError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants