LLM-63: Add API-backed support for evaluated models and judges ( model_engine='api' and judge_engine='api')#93
Conversation
|
@BugBot run |
- Added dotenv support for environment variable loading. - Updated API evaluation engine to handle batch processing with concurrency. - Introduced raw text collator for API models without tokenization. - Enhanced dataset preprocessing for API models, allowing for raw input handling. - Improved judge tokenizer preparation and dataset rebuilding for API evaluations. - Refactored evaluator methods to streamline prompt generation and processing. - Set default values for config and load formats in vLLM configuration.
- Introduced batch completion support in the FakeLiteLLM for improved testing. - Added tests to validate API model and judge engine configurations against incompatible inference engines. - Enhanced test coverage for API evaluation engine's handling of default batch sizes and missing tokenizer scenarios. - Refactored evaluator tests to utilize a stubbed API evaluation engine for better isolation and reliability.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 986cceb18c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Added detailed instructions for evaluating models via remote API endpoints, including support for Azure OpenAI, vLLM, OpenAI, Google Vertex AI, and Anthropic. - Included environment variable setup for Azure and other API services. - Updated `uv.lock` to reflect new package dependencies and version adjustments, including `boto3`, `google-api-core`, and `litellm`. - Adjusted Python version requirements for several packages to ensure compatibility.
- Introduced `EvalDataset` type to standardize dataset handling in evaluation engines. - Updated dataset type annotations in `ApiEvalEngine`, `TransformersEvalEngine`, and `VllmEvalEngine` to use `EvalDataset`. - Replaced `Dataset` with `HFDataset` in `BaseEvaluator` for improved clarity. - Enhanced type safety by refining type hints in various methods and functions. - Added dotenv support for environment variable loading in `evaluate.py` to streamline configuration.
- Reformatted conditional statements and logging calls for better clarity in `base_evaluator.py` and `custom_dataset.py`. - Enhanced readability of batch decoding in `free_text_hallu_evaluator.py` and `free_text_injection_evaluator.py`. - Simplified return statement in `test_api_eval_engine.py` for cleaner code. - Improved constructor formatting in `test_base_evaluator.py` for consistency.
|
@BugBot run |
…oken usage - Refactored the max_tokens assignment in `ApiEvalEngine` to utilize a dedicated method for improved clarity and flexibility. - Introduced a new test in `test_api_eval_engine.py` to verify that the model correctly uses the configured max_answer_tokens during answer generation.
benglewis
left a comment
There was a problem hiding this comment.
There are a few important comments
shmuelyo
left a comment
There was a problem hiding this comment.
very minor comments beyond @benglewis's
orr-hirundo
left a comment
There was a problem hiding this comment.
Lots of changes in this PR, I think it's worth running similar evals with the same judge model via API and locally, just to make sure we're not breaking anything - specifically with the parts around rebuilding the judge dataset, and the code inside base_evaluator requiring the engine to be api_engine.
Greptile OverviewConfidence Score: 4/5
|
| Filename | Overview |
|---|---|
| llm_behavior_eval/evaluation_utils/api_eval_engine.py | New API evaluation engine using LiteLLM for both model and judge inference with batch processing and environment-based concurrency |
| llm_behavior_eval/evaluation_utils/eval_config.py | Added model_tokenizer_path_or_repo_id, model_engine, judge_engine fields with validation preventing API engines from combining with inference_engine |
| llm_behavior_eval/evaluation_utils/base_evaluator.py | Extended evaluator to support API engines with raw-text collator, format_judge_messages helper, and judge dataset rebuilding for API model + tokenizer-based judge scenarios |
| tests/test_api_eval_engine.py | Comprehensive unit tests for ApiEvalEngine covering LiteLLM kwargs, tokenizer handling, batch sizing, and environment variables |
| llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py | Refactored generate() and grade() to handle both tokenized and raw message batches, combined API judge prompts for better throughput |
Sequence Diagram
sequenceDiagram
participant User
participant CLI as evaluate.py
participant Factory as EvaluateFactory
participant Evaluator as BaseEvaluator
participant ApiEngine as ApiEvalEngine
participant LiteLLM
participant APIProvider as API Provider<br/>(OpenAI/Anthropic/etc)
User->>CLI: llm-behavior-eval model behavior<br/>--model-engine api
CLI->>Factory: create_evaluator(eval_config, dataset_config)
Factory->>Evaluator: __init__(eval_config, dataset_config)
alt model_engine == "api"
Evaluator->>ApiEngine: ApiEvalEngine(eval_config, is_judge=False)
ApiEngine->>ApiEngine: _load_litellm()
ApiEngine->>ApiEngine: _load_model_tokenizer()<br/>(if model_tokenizer_path_or_repo_id set)
end
Evaluator->>Evaluator: prepare_dataloader()
alt api_raw_mode (no tokenizer)
Evaluator->>Evaluator: use raw_text_collator
Note over Evaluator: Dataset contains message lists,<br/>not token IDs
else with tokenizer
Evaluator->>Evaluator: tokenize prompts
Note over Evaluator: Dataset contains input_ids
end
User->>Evaluator: evaluate()
Evaluator->>Evaluator: generate()
loop For each batch
alt api_raw_mode
Evaluator->>ApiEngine: generate_answers_from_prompts(messages)
else tokenized
Evaluator->>ApiEngine: generate_answers(input_ids, attention_mask)
ApiEngine->>ApiEngine: decode tokens to prompts
end
ApiEngine->>ApiEngine: _normalize_messages(prompt)
ApiEngine->>LiteLLM: batch_completion(model, messages, **kwargs)
loop Concurrent requests (LLM_EVAL_API_CONCURRENCY)
LiteLLM->>APIProvider: HTTP request
APIProvider-->>LiteLLM: response
end
LiteLLM-->>ApiEngine: responses
ApiEngine->>ApiEngine: _extract_content(response)
ApiEngine-->>Evaluator: answers
end
Evaluator->>Evaluator: free_test_model()
Evaluator->>Evaluator: get_grading_context()
alt judge_engine == "api"
Evaluator->>ApiEngine: ApiEvalEngine(eval_config, is_judge=True)
else judge_engine != "api" && api_raw_mode
Evaluator->>Evaluator: _build_tokenized_dataset_for_judge()
Note over Evaluator: Rebuild dataset with judge tokenizer<br/>for tokenizer-based judge
end
Evaluator->>Evaluator: grade(generations, judge_engine)
Evaluator->>Evaluator: format_judge_messages(messages)
alt judge_engine == "api"
Note over Evaluator: Returns message list for API
Evaluator->>ApiEngine: generate_answers_from_prompts(judge_prompts)
ApiEngine->>LiteLLM: batch_completion(...)
LiteLLM->>APIProvider: concurrent judge requests
APIProvider-->>LiteLLM: judge responses
LiteLLM-->>ApiEngine: results
else judge_engine != "api"
Note over Evaluator: Returns formatted string for tokenizer
Evaluator->>Evaluator: tokenize judge prompts
Evaluator->>Evaluator: run_judge_with_backoff()
end
Evaluator-->>User: evaluation results
Additional Comments (1)
Prompt To Fix With AIThis is a comment left during a code review.
Path: llm_behavior_eval/evaluation_utils/vllm_config.py
Line: 11:12
Comment:
check that existing vLLM configurations still work after changing defaults from `None` to `"auto"`
How can I resolve this? If you propose a fix, please make it concise. |
model_engine='api')
orr-hirundo
left a comment
There was a problem hiding this comment.
Looks good in general, some comments
| Preprocess a batch for API models without a tokenizer. | ||
|
|
||
| Builds message lists and keeps ground-truth fields as strings. | ||
| Since no tokenizer is available in this path, truncation is approximated by |
There was a problem hiding this comment.
Just curious, if we are overestimating the number of tokens - will the API model fail or just truncate even shorter automatically?
There was a problem hiding this comment.
Can you please elaborate? Do you mean "overestimating" as in our count > actual count, and then we truncate too much? If so, I don't see why the API model should fail. @orr-hirundo
There was a problem hiding this comment.
Following a clarification:
Thanks, that makes sense. If we under-truncate, we first attempt message-level trimming via LiteLLM; if the request is still over the provider’s context limit, most providers will return a context-length error (some may auto-truncate, provider-dependent). We currently don’t rely on provider auto-truncation or implement an extra retry-with-smaller-prompt loop in this path.
There was a problem hiding this comment.
Commit f147ab9 addressed this comment by providing an explicit clarification in the thread: the system does not rely on provider auto-truncation. If the prompt is still over the context limit after LiteLLM trimming, most providers return a context-length error (with auto-truncation being provider-dependent), and there is no retry-with-smaller-prompt loop implemented.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
# Conflicts: # llm_behavior_eval/evaluate.py # llm_behavior_eval/evaluation_utils/base_evaluator.py # llm_behavior_eval/evaluation_utils/custom_dataset.py # llm_behavior_eval/evaluation_utils/eval_engine.py # llm_behavior_eval/evaluation_utils/transformers_eval_engine.py # llm_behavior_eval/evaluation_utils/util_functions.py # llm_behavior_eval/evaluation_utils/vllm_eval_engine.py # pyproject.toml # uv.lock
# Conflicts: # uv.lock
Fix various API bugs Improve error handling by adding retries of up-to 3 times of transient errors Raise on non-transient errors rather than hiding them silently Make all of this configurable Make AGENTS.md confuse agents less
| custom_dataset = CustomDataset( | ||
| self.dataset_config.file_path, self.dataset_config.dataset_type | ||
| ) | ||
| preprocess_config = self.dataset_config.preprocess_config |
There was a problem hiding this comment.
prepare_dataset_for_grading duplicates the dataset-prep flow in prepare_dataloader, so should we extract the shared creation/selection into a helper used by both to keep shuffle, truncation, and selected indices in sync?
Finding type: Code Dedup and Conventions | Severity: 🟢 Low
| "accuracy": accuracy, | ||
| "error": 1 - accuracy, | ||
| "empty_responses": float(empty_responses), | ||
| "api_error_responses": float(api_error_responses), | ||
| "num_samples": float(self.num_samples), |
There was a problem hiding this comment.
api_error_responses is added to mlflow_metrics unconditionally, should we gate it behind api_error_responses > 0 (or only log for API-backed evaluators) so the emitted metric schema doesn’t change for non-API evaluators?
Finding type: Breaking Changes | Severity: 🟠 Medium
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluation_utils/base_evaluator.py around lines 649-654, in the
`save_results(...)` MLflow logging block, the code currently adds
`"api_error_responses": float(api_error_responses)` unconditionally (even when the
default is 0). Change this to only log the `api_error_responses` metric when
`api_error_responses > 0` (or alternatively only when running an API-backed evaluator),
so non-API evaluators keep the previous MLflow metric schema. Then update/verify any
tests that assert exact MLflow metric keys (e.g., tests/test_mlflow_integration.py) to
confirm the metric dict remains unchanged for default/non-API runs.
| evaluator.update_dataset_config(dataset_config) | ||
| if judge is not None: | ||
| judge.set_dataset(evaluator.eval_dataset) |
There was a problem hiding this comment.
In multi-dataset runs, judge.set_dataset(...) is only followed by dataset config updates, so judge preprocess limits can stay stuck on the context’s initial dataset while later datasets may need different trimming — should we call judge.set_preprocess_limits(...) after each dataset switch?
Finding type: Logical Bugs | Severity: 🔴 High
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluate.py around lines 706-708 inside the “Grading loop” (the
`with evaluator.get_grading_context() as judge:` block), update the judge’s
preprocess/token limits when switching datasets. Currently you call
`evaluator.update_dataset_config(dataset_config)` and then
`judge.set_dataset(evaluator.eval_dataset)`, but you never refresh preprocess limits, so
multi-dataset runs reuse limits from the dataset bound when the grading context opened.
Refactor by adding a call to the judge-side method that updates preprocess limits (e.g.,
`judge.set_preprocess_limits(...)`) immediately after
`judge.set_dataset(evaluator.eval_dataset)` for each dataset, using the values derived
from the just-updated `evaluator.eval_dataset` / its preprocess config, before calling
`evaluator.grade(...)`. Ensure the call happens every iteration of the zip loop so API
judges trim prompts consistently per dataset.
| def _sleep_before_retry(self, retry_count: int) -> None: | ||
| backoff = max(0.0, self.eval_config.api_retry_backoff_seconds) | ||
| if backoff <= 0: | ||
| return | ||
| time.sleep(backoff * (2 ** max(0, retry_count - 1))) |
There was a problem hiding this comment.
_sleep_before_retry() uses an uncapped exponential delay without jitter, so should we switch to bounded exponential backoff with jitter (min/max delay) before each retry?
Finding type: validate endpoints and retries | Severity: 🟢 Low
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluation_utils/api_eval_engine.py around lines 207-211, the
`_sleep_before_retry(retry_count)` method currently computes an unbounded exponential
backoff (`backoff * 2**(...)`) and never caps or adds jitter. Refactor it to use a
bounded exponential backoff with jitter: introduce/consume configuration for minimum
delay, maximum delay, and jitter amount (or reasonable constants if config doesn’t
exist), compute `sleep_time` as a capped base delay plus randomized jitter, and then
call `time.sleep(sleep_time)` before retrying. Ensure you import `random` if needed and
keep the retry_count-to-exponent mapping consistent with “retry_count=1 is the first
retry” (adjust the exponent offset if required).
| error_record: ApiGenerationError = { | ||
| "api_error_type": type(error).__name__, | ||
| "api_error_message": str(error), | ||
| "api_error_kind": error_kind, |
There was a problem hiding this comment.
api_error_message stores str(error) verbatim in generation_errors, then merges into generations.jsonl/responses.json via resp.update(generation_error, so should we redact or whitelist before persisting (or truncate to a safe summary)?
Finding type: Basic Security Patterns | Severity: 🔴 High
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluation_utils/api_eval_engine.py around lines 213-237, inside the
`_record_api_skip` method, stop persisting `str(error)` verbatim into the
`api_error_message` field. Refactor by adding a small sanitizer helper (e.g.,
`_safe_error_message(error: Exception)`) that truncates to a short length and redacts
common secrets/patterns (API keys, bearer/ad tokens,
`Authorization`/`api_key`/`token`-like substrings) before storing. Then update
`_record_api_skip` to use the sanitized message so the redacted value is what gets
written to `generations.jsonl` and later merged.
| skipped_indices = { | ||
| index | ||
| for index, error in enumerate(generation_errors) | ||
| if error is not None and bool(error.get("api_error_skipped")) | ||
| } |
There was a problem hiding this comment.
generation_errors from generations.jsonl is only cast in _collect_generations() but _grade_impl() assumes it’s a dict and calls error.get(...), so can we validate/normalize the loaded generation_errors before the loop?
Finding type: Type Inconsistency | Severity: 🟢 Low
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py around lines 414-418
inside FreeTextBiasEvaluator._grade_impl, you currently build generation_errors via
`generation_record.generation_errors or [...]` and then call
`error.get("api_error_skipped")` when computing skipped_indices. Refactor by
validating/normalizing generation_errors before the skipped_indices comprehension:
ensure it is a list with length matching len(generation_record.answers), and for each
entry only keep it if it is dict-like (has a .get); otherwise treat it as None. If the
overall type/length is clearly wrong, fail fast with a clear schema error (ValueError)
rather than letting grading crash with AttributeError.
User description
Motivation
Description
ApiEvalEngine(llm_behavior_eval/evaluation_utils/api_eval_engine.py) that supports both judge (is_judge=True) and evaluated-model (is_judge=False) usage and decodes prompts with a tokenizer when running API-backed evaluated models.model_engine='api'into the evaluator by selectingApiEvalEngineinBaseEvaluatorwhenmodel_engineisapiand by initializing the evaluated-model tokenizer frommodel_tokenizer_path_or_repo_idwhen present.--model-engine apiand--model-tokenizerand extendedEvaluationConfigto includemodel_tokenizer_path_or_repo_idand validation thatmodel_engine='api'(orjudge_engine='api') cannot be combined withinference_engine.format_judge_messages, skipping judge tokenizer setup for API judges, and updating_process_judge_prompts_batch/run_judge_with_backoffto acceptJudgePrompttypes and to callgenerate_answers_from_promptsfor API engines.--model-engine apidocs toREADME.mdand anapioptional dependency group inpyproject.tomlforlitellm,openai,anthropic,boto3, andgoogle-cloud-aiplatform.tests/test_api_eval_engine.pyand extendedtests/test_base_evaluator.pywithtest_get_grading_context_supports_api_judge_engineandtest_api_model_engine_uses_api_eval_engineto verify LiteLLM completion kwargs and engine selection.Testing
tests/test_api_eval_engine.pyand updates totests/test_base_evaluator.pythat mocklitellmandApiEvalEnginebehavior and assert API call construction and evaluator wiring, but no test suite (pytest) was executed in this rollout.Codex Task
Note
Medium Risk
Refactors the core evaluation flow (dataset preprocessing, prompt formatting, and engine interfaces) while adding a new remote-API execution path; mistakes could change evaluation behavior or batching/seed determinism across backends.
Overview
Adds an API-backed inference option for both the evaluated model and judge (
--model-engine api,--judge-engine api, or--inference-engine api) via a new LiteLLM-basedApiEvalEngine, including configurable concurrency and provider-side message/token trimming.Refactors evaluation to a prompt-first pipeline: datasets now preprocess into raw
test_messages/text fields with a raw-text collator, andTransformersEvalEngine/VllmEvalEngineare updated to generate from prompts (plus new optional--model-tokenizer/--judge-tokenizeroverrides) with config validation to prevent tokenizer overrides when using API engines.Updates packaging/docs/tests: adds API optional dependency groups in
pyproject.toml, documents provider setup and usage inREADME.md, adds extensive unit tests for API and prompt-engine wiring, introduces an opt-ingpu_regressionpytest marker, loads.envin the CLI whenpython-dotenvis available, and ignores top-levelresults/in.gitignore.Written by Cursor Bugbot for commit 6b9fcf7. This will update automatically on new commits. Configure here.
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Add API-based evaluated-model and judge support through the new
ApiEvalEngine, wiring CLI/config flags, README instructions, optional dependencies, and tests for LiteLLM-backed providers. Refactor the prompt-first evaluator pipeline (dataset preprocessing, BaseEvaluator, and transformers/vLLM adapters) to emit raw messages reused by all engines while maintaining tokenizer overrides where appropriate.test_messages/text fields, BaseEvaluator and engines consume prompts, and free-text evaluators/tests reuse the shared prompt pipeline.Modified files (15)
Latest Contributors(2)
Modified files (8)
Latest Contributors(2)
ApiEvalEngine, CLI/config validation, packaging/docs, and tests to exercise LiteLLM provider wiring and error handling.Modified files (8)
Latest Contributors(2)