LLM-63: Add API-backed support for evaluated models and judges ( `model_engine='api'` and `judge_engine='api'`) by mishana · Pull Request #93 · Hirundo-io/llm-behavior-eval

mishana · 2026-01-19T14:39:12Z

User description

Motivation

Allow the evaluated model (not just the judge) to run via hosted/API providers using the same LiteLLM-backed code path used for API judges.
Avoid forcing HF tokenizer/model downloads for API model identifiers while still permitting prompt decoding via a supplied HF tokenizer when needed.

Description

Introduced ApiEvalEngine (llm_behavior_eval/evaluation_utils/api_eval_engine.py) that supports both judge (is_judge=True) and evaluated-model (is_judge=False) usage and decodes prompts with a tokenizer when running API-backed evaluated models.
Wire model_engine='api' into the evaluator by selecting ApiEvalEngine in BaseEvaluator when model_engine is api and by initializing the evaluated-model tokenizer from model_tokenizer_path_or_repo_id when present.
Added CLI/Config support for --model-engine api and --model-tokenizer and extended EvaluationConfig to include model_tokenizer_path_or_repo_id and validation that model_engine='api' (or judge_engine='api') cannot be combined with inference_engine.
Adapted judge formatting and processing to support API judges by adding format_judge_messages, skipping judge tokenizer setup for API judges, and updating _process_judge_prompts_batch / run_judge_with_backoff to accept JudgePrompt types and to call generate_answers_from_prompts for API engines.
Documentation and packaging: added --model-engine api docs to README.md and an api optional dependency group in pyproject.toml for litellm, openai, anthropic, boto3, and google-cloud-aiplatform.
Tests: added tests/test_api_eval_engine.py and extended tests/test_base_evaluator.py with test_get_grading_context_supports_api_judge_engine and test_api_model_engine_uses_api_eval_engine to verify LiteLLM completion kwargs and engine selection.

Testing

Added unit tests tests/test_api_eval_engine.py and updates to tests/test_base_evaluator.py that mock litellm and ApiEvalEngine behavior and assert API call construction and evaluator wiring, but no test suite (pytest) was executed in this rollout.
Project files were updated and committed successfully; dependency lock refresh attempted previously failed due to network issues, so CI should run the new tests and validate optional dependency resolution.

Codex Task

Note

Medium Risk
Refactors the core evaluation flow (dataset preprocessing, prompt formatting, and engine interfaces) while adding a new remote-API execution path; mistakes could change evaluation behavior or batching/seed determinism across backends.

Overview
Adds an API-backed inference option for both the evaluated model and judge (--model-engine api, --judge-engine api, or --inference-engine api) via a new LiteLLM-based ApiEvalEngine, including configurable concurrency and provider-side message/token trimming.

Refactors evaluation to a prompt-first pipeline: datasets now preprocess into raw test_messages/text fields with a raw-text collator, and TransformersEvalEngine/VllmEvalEngine are updated to generate from prompts (plus new optional --model-tokenizer/--judge-tokenizer overrides) with config validation to prevent tokenizer overrides when using API engines.

Updates packaging/docs/tests: adds API optional dependency groups in pyproject.toml, documents provider setup and usage in README.md, adds extensive unit tests for API and prompt-engine wiring, introduces an opt-in gpu_regression pytest marker, loads .env in the CLI when python-dotenv is available, and ignores top-level results/ in .gitignore.

^{Written by Cursor Bugbot for commit 6b9fcf7. This will update automatically on new commits. Configure here.}

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Add API-based evaluated-model and judge support through the new ApiEvalEngine, wiring CLI/config flags, README instructions, optional dependencies, and tests for LiteLLM-backed providers. Refactor the prompt-first evaluator pipeline (dataset preprocessing, BaseEvaluator, and transformers/vLLM adapters) to emit raw messages reused by all engines while maintaining tokenizer overrides where appropriate.

Topic Details

Prompt pipeline

Refactor the prompt-centric preprocessing/evaluator flow so datasets emit raw test_messages/text fields, BaseEvaluator and engines consume prompts, and free-text evaluators/tests reuse the shared prompt pipeline.

Modified files (15)

llm_behavior_eval/evaluation_utils/base_evaluator.py
llm_behavior_eval/evaluation_utils/custom_dataset.py
llm_behavior_eval/evaluation_utils/eval_engine.py
llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py
llm_behavior_eval/evaluation_utils/free_text_hallu_evaluator.py
llm_behavior_eval/evaluation_utils/free_text_injection_evaluator.py
llm_behavior_eval/evaluation_utils/transformers_eval_engine.py
llm_behavior_eval/evaluation_utils/vllm_eval_engine.py
tests/test_base_evaluator.py
tests/test_custom_dataset.py
tests/test_eval_engine_interface.py
tests/test_eval_engines.py
tests/test_free_text_bias_evaluator.py
tests/test_gpu_regression.py
tests/test_util_functions.py

Latest Contributors(2)

User	Commit	Date
blewis@hirundo.io	Improve API call robus...	May 07, 2026
shmuelyo	LLM-85: Add support tr...	May 06, 2026

Other

Other files

Modified files (8)

.gitignore
AGENTS.md
llm_behavior_eval/evaluation_utils/litellm_utils.py
llm_behavior_eval/evaluation_utils/util_functions.py
llm_behavior_eval/evaluation_utils/vllm_config.py
tests/conftest.py
tests/test_evaluate_cli.py
uv.lock

Latest Contributors(2)

User	Commit	Date
blewis@hirundo.io	Improve API call robus...	May 07, 2026
mishana4life@gmail.com	Merge main into codex/...	March 02, 2026

API eval engine

Implement API-backed model/judge execution via ApiEvalEngine, CLI/config validation, packaging/docs, and tests to exercise LiteLLM provider wiring and error handling.

Modified files (8)

README.md
llm_behavior_eval/evaluate.py
llm_behavior_eval/evaluation_utils/api_eval_engine.py
llm_behavior_eval/evaluation_utils/eval_config.py
pyproject.toml
tests/test_api_eval_engine.py
tests/test_base_evaluator.py
tests/test_eval_config.py

Latest Contributors(2)

User	Commit	Date
blewis@hirundo.io	Improve API call robus...	May 07, 2026
shmuelyo	LLM-85: Add support tr...	May 06, 2026

This pull request is reviewed by Baz. Review like a pro on (Baz).

mishana · 2026-01-19T14:46:40Z

@BugBot run

- Added dotenv support for environment variable loading. - Updated API evaluation engine to handle batch processing with concurrency. - Introduced raw text collator for API models without tokenization. - Enhanced dataset preprocessing for API models, allowing for raw input handling. - Improved judge tokenizer preparation and dataset rebuilding for API evaluations. - Refactored evaluator methods to streamline prompt generation and processing. - Set default values for config and load formats in vLLM configuration.

- Introduced batch completion support in the FakeLiteLLM for improved testing. - Added tests to validate API model and judge engine configurations against incompatible inference engines. - Enhanced test coverage for API evaluation engine's handling of default batch sizes and missing tokenizer scenarios. - Refactored evaluator tests to utilize a stubbed API evaluation engine for better isolation and reliability.

mishana · 2026-01-27T18:57:21Z

https://linear.app/hirundo/issue/LLM-63/add-api-backed-support-for-evaluated-models-and-judges

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 986cceb18c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Added detailed instructions for evaluating models via remote API endpoints, including support for Azure OpenAI, vLLM, OpenAI, Google Vertex AI, and Anthropic. - Included environment variable setup for Azure and other API services. - Updated `uv.lock` to reflect new package dependencies and version adjustments, including `boto3`, `google-api-core`, and `litellm`. - Adjusted Python version requirements for several packages to ensure compatibility.

- Introduced `EvalDataset` type to standardize dataset handling in evaluation engines. - Updated dataset type annotations in `ApiEvalEngine`, `TransformersEvalEngine`, and `VllmEvalEngine` to use `EvalDataset`. - Replaced `Dataset` with `HFDataset` in `BaseEvaluator` for improved clarity. - Enhanced type safety by refining type hints in various methods and functions. - Added dotenv support for environment variable loading in `evaluate.py` to streamline configuration.

- Reformatted conditional statements and logging calls for better clarity in `base_evaluator.py` and `custom_dataset.py`. - Enhanced readability of batch decoding in `free_text_hallu_evaluator.py` and `free_text_injection_evaluator.py`. - Simplified return statement in `test_api_eval_engine.py` for cleaner code. - Improved constructor formatting in `test_base_evaluator.py` for consistency.

mishana · 2026-01-27T21:32:01Z

@BugBot run

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

…oken usage - Refactored the max_tokens assignment in `ApiEvalEngine` to utilize a dedicated method for improved clarity and flexibility. - Introduced a new test in `test_api_eval_engine.py` to verify that the model correctly uses the configured max_answer_tokens during answer generation.

benglewis

There are a few important comments

shmuelyo

very minor comments beyond @benglewis's

orr-hirundo

Lots of changes in this PR, I think it's worth running similar evals with the same judge model via API and locally, just to make sure we're not breaking anything - specifically with the parts around rebuilding the judge dataset, and the code inside base_evaluator requiring the engine to be api_engine.

greptile-apps · 2026-01-29T09:27:37Z

Greptile Overview

Confidence Score: 4/5

This PR is safe to merge with careful attention to vLLM config compatibility and integration testing
Score reflects well-structured implementation with strong unit test coverage and comprehensive documentation, but deducted 1 point due to (1) vLLM config defaults change that needs validation with existing setups, and (2) lack of integration tests with real API providers
Pay close attention to llm_behavior_eval/evaluation_utils/vllm_config.py for the config defaults change

Important Files Changed

Filename	Overview
llm_behavior_eval/evaluation_utils/api_eval_engine.py	New API evaluation engine using LiteLLM for both model and judge inference with batch processing and environment-based concurrency
llm_behavior_eval/evaluation_utils/eval_config.py	Added `model_tokenizer_path_or_repo_id`, `model_engine`, `judge_engine` fields with validation preventing API engines from combining with `inference_engine`
llm_behavior_eval/evaluation_utils/base_evaluator.py	Extended evaluator to support API engines with raw-text collator, format_judge_messages helper, and judge dataset rebuilding for API model + tokenizer-based judge scenarios
tests/test_api_eval_engine.py	Comprehensive unit tests for ApiEvalEngine covering LiteLLM kwargs, tokenizer handling, batch sizing, and environment variables
llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py	Refactored generate() and grade() to handle both tokenized and raw message batches, combined API judge prompts for better throughput

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as evaluate.py
    participant Factory as EvaluateFactory
    participant Evaluator as BaseEvaluator
    participant ApiEngine as ApiEvalEngine
    participant LiteLLM
    participant APIProvider as API Provider<br/>(OpenAI/Anthropic/etc)

    User->>CLI: llm-behavior-eval model behavior<br/>--model-engine api
    CLI->>Factory: create_evaluator(eval_config, dataset_config)
    Factory->>Evaluator: __init__(eval_config, dataset_config)
    
    alt model_engine == "api"
        Evaluator->>ApiEngine: ApiEvalEngine(eval_config, is_judge=False)
        ApiEngine->>ApiEngine: _load_litellm()
        ApiEngine->>ApiEngine: _load_model_tokenizer()<br/>(if model_tokenizer_path_or_repo_id set)
    end
    
    Evaluator->>Evaluator: prepare_dataloader()
    
    alt api_raw_mode (no tokenizer)
        Evaluator->>Evaluator: use raw_text_collator
        Note over Evaluator: Dataset contains message lists,<br/>not token IDs
    else with tokenizer
        Evaluator->>Evaluator: tokenize prompts
        Note over Evaluator: Dataset contains input_ids
    end
    
    User->>Evaluator: evaluate()
    Evaluator->>Evaluator: generate()
    
    loop For each batch
        alt api_raw_mode
            Evaluator->>ApiEngine: generate_answers_from_prompts(messages)
        else tokenized
            Evaluator->>ApiEngine: generate_answers(input_ids, attention_mask)
            ApiEngine->>ApiEngine: decode tokens to prompts
        end
        
        ApiEngine->>ApiEngine: _normalize_messages(prompt)
        ApiEngine->>LiteLLM: batch_completion(model, messages, **kwargs)
        
        loop Concurrent requests (LLM_EVAL_API_CONCURRENCY)
            LiteLLM->>APIProvider: HTTP request
            APIProvider-->>LiteLLM: response
        end
        
        LiteLLM-->>ApiEngine: responses
        ApiEngine->>ApiEngine: _extract_content(response)
        ApiEngine-->>Evaluator: answers
    end
    
    Evaluator->>Evaluator: free_test_model()
    Evaluator->>Evaluator: get_grading_context()
    
    alt judge_engine == "api"
        Evaluator->>ApiEngine: ApiEvalEngine(eval_config, is_judge=True)
    else judge_engine != "api" && api_raw_mode
        Evaluator->>Evaluator: _build_tokenized_dataset_for_judge()
        Note over Evaluator: Rebuild dataset with judge tokenizer<br/>for tokenizer-based judge
    end
    
    Evaluator->>Evaluator: grade(generations, judge_engine)
    Evaluator->>Evaluator: format_judge_messages(messages)
    
    alt judge_engine == "api"
        Note over Evaluator: Returns message list for API
        Evaluator->>ApiEngine: generate_answers_from_prompts(judge_prompts)
        ApiEngine->>LiteLLM: batch_completion(...)
        LiteLLM->>APIProvider: concurrent judge requests
        APIProvider-->>LiteLLM: judge responses
        LiteLLM-->>ApiEngine: results
    else judge_engine != "api"
        Note over Evaluator: Returns formatted string for tokenizer
        Evaluator->>Evaluator: tokenize judge prompts
        Evaluator->>Evaluator: run_judge_with_backoff()
    end
    
    Evaluator-->>User: evaluation results

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-29T09:27:45Z

Additional Comments (1)

llm_behavior_eval/evaluation_utils/vllm_config.py
check that existing vLLM configurations still work after changing defaults from None to "auto"

Prompt To Fix With AI

This is a comment left during a code review.
Path: llm_behavior_eval/evaluation_utils/vllm_config.py
Line: 11:12

Comment:
check that existing vLLM configurations still work after changing defaults from `None` to `"auto"`

How can I resolve this? If you propose a fix, please make it concise.

orr-hirundo

Looks good in general, some comments

orr-hirundo · 2026-02-25T08:27:40Z

+    Preprocess a batch for API models without a tokenizer.
+
+    Builds message lists and keeps ground-truth fields as strings.
+    Since no tokenizer is available in this path, truncation is approximated by


Just curious, if we are overestimating the number of tokens - will the API model fail or just truncate even shorter automatically?

Can you please elaborate? Do you mean "overestimating" as in our count > actual count, and then we truncate too much? If so, I don't see why the API model should fail. @orr-hirundo

Following a clarification:

Thanks, that makes sense. If we under-truncate, we first attempt message-level trimming via LiteLLM; if the request is still over the provider’s context limit, most providers will return a context-length error (some may auto-truncate, provider-dependent). We currently don’t rely on provider auto-truncation or implement an extra retry-with-smaller-prompt loop in this path.

Commit f147ab9 addressed this comment by providing an explicit clarification in the thread: the system does not rely on provider auto-truncation. If the prompt is still over the context limit after LiteLLM trimming, most providers return a context-length error (with auto-truncation being provider-dependent), and there is no retry-with-smaller-prompt loop implemented.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

# Conflicts: # llm_behavior_eval/evaluate.py # llm_behavior_eval/evaluation_utils/base_evaluator.py # llm_behavior_eval/evaluation_utils/custom_dataset.py # llm_behavior_eval/evaluation_utils/eval_engine.py # llm_behavior_eval/evaluation_utils/transformers_eval_engine.py # llm_behavior_eval/evaluation_utils/util_functions.py # llm_behavior_eval/evaluation_utils/vllm_eval_engine.py # pyproject.toml # uv.lock

# Conflicts: # uv.lock

Fix various API bugs Improve error handling by adding retries of up-to 3 times of transient errors Raise on non-transient errors rather than hiding them silently Make all of this configurable Make AGENTS.md confuse agents less

baz-reviewer · 2026-05-07T22:34:09Z

+        custom_dataset = CustomDataset(
+            self.dataset_config.file_path, self.dataset_config.dataset_type
+        )
+        preprocess_config = self.dataset_config.preprocess_config


prepare_dataset_for_grading duplicates the dataset-prep flow in prepare_dataloader, so should we extract the shared creation/selection into a helper used by both to keep shuffle, truncation, and selected indices in sync?

_{Finding type: Code Dedup and Conventions | Severity: 🟢 Low}

baz-reviewer · 2026-05-07T22:34:09Z

                "accuracy": accuracy,
                "error": 1 - accuracy,
                "empty_responses": float(empty_responses),
+                "api_error_responses": float(api_error_responses),
                "num_samples": float(self.num_samples),


api_error_responses is added to mlflow_metrics unconditionally, should we gate it behind api_error_responses > 0 (or only log for API-backed evaluators) so the emitted metric schema doesn’t change for non-API evaluators?

_{Finding type: Breaking Changes | Severity: 🟠 Medium}

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In llm_behavior_eval/evaluation_utils/base_evaluator.py around lines 649-654, in the `save_results(...)` MLflow logging block, the code currently adds `"api_error_responses": float(api_error_responses)` unconditionally (even when the default is 0). Change this to only log the `api_error_responses` metric when `api_error_responses > 0` (or alternatively only when running an API-backed evaluator), so non-API evaluators keep the previous MLflow metric schema. Then update/verify any tests that assert exact MLflow metric keys (e.g., tests/test_mlflow_integration.py) to confirm the metric dict remains unchanged for default/non-API runs.

baz-reviewer · 2026-05-07T22:34:09Z

                evaluator.update_dataset_config(dataset_config)
+                if judge is not None:
+                    judge.set_dataset(evaluator.eval_dataset)


In multi-dataset runs, judge.set_dataset(...) is only followed by dataset config updates, so judge preprocess limits can stay stuck on the context’s initial dataset while later datasets may need different trimming — should we call judge.set_preprocess_limits(...) after each dataset switch?

_{Finding type: Logical Bugs | Severity: 🔴 High}

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In llm_behavior_eval/evaluate.py around lines 706-708 inside the “Grading loop” (the `with evaluator.get_grading_context() as judge:` block), update the judge’s preprocess/token limits when switching datasets. Currently you call `evaluator.update_dataset_config(dataset_config)` and then `judge.set_dataset(evaluator.eval_dataset)`, but you never refresh preprocess limits, so multi-dataset runs reuse limits from the dataset bound when the grading context opened. Refactor by adding a call to the judge-side method that updates preprocess limits (e.g., `judge.set_preprocess_limits(...)`) immediately after `judge.set_dataset(evaluator.eval_dataset)` for each dataset, using the values derived from the just-updated `evaluator.eval_dataset` / its preprocess config, before calling `evaluator.grade(...)`. Ensure the call happens every iteration of the zip loop so API judges trim prompts consistently per dataset.

baz-reviewer · 2026-05-07T22:34:09Z

+    def _sleep_before_retry(self, retry_count: int) -> None:
+        backoff = max(0.0, self.eval_config.api_retry_backoff_seconds)
+        if backoff <= 0:
+            return
+        time.sleep(backoff * (2 ** max(0, retry_count - 1)))


_sleep_before_retry() uses an uncapped exponential delay without jitter, so should we switch to bounded exponential backoff with jitter (min/max delay) before each retry?

_{Finding type: validate endpoints and retries | Severity: 🟢 Low}

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In llm_behavior_eval/evaluation_utils/api_eval_engine.py around lines 207-211, the `_sleep_before_retry(retry_count)` method currently computes an unbounded exponential backoff (`backoff * 2**(...)`) and never caps or adds jitter. Refactor it to use a bounded exponential backoff with jitter: introduce/consume configuration for minimum delay, maximum delay, and jitter amount (or reasonable constants if config doesn’t exist), compute `sleep_time` as a capped base delay plus randomized jitter, and then call `time.sleep(sleep_time)` before retrying. Ensure you import `random` if needed and keep the retry_count-to-exponent mapping consistent with “retry_count=1 is the first retry” (adjust the exponent offset if required).

baz-reviewer · 2026-05-07T22:34:09Z

+        error_record: ApiGenerationError = {
+            "api_error_type": type(error).__name__,
+            "api_error_message": str(error),
+            "api_error_kind": error_kind,


api_error_message stores str(error) verbatim in generation_errors, then merges into generations.jsonl/responses.json via resp.update(generation_error, so should we redact or whitelist before persisting (or truncate to a safe summary)?

_{Finding type: Basic Security Patterns | Severity: 🔴 High}

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In llm_behavior_eval/evaluation_utils/api_eval_engine.py around lines 213-237, inside the `_record_api_skip` method, stop persisting `str(error)` verbatim into the `api_error_message` field. Refactor by adding a small sanitizer helper (e.g., `_safe_error_message(error: Exception)`) that truncates to a short length and redacts common secrets/patterns (API keys, bearer/ad tokens, `Authorization`/`api_key`/`token`-like substrings) before storing. Then update `_record_api_skip` to use the sanitized message so the redacted value is what gets written to `generations.jsonl` and later merged.

baz-reviewer · 2026-05-07T22:34:09Z

+            skipped_indices = {
+                index
+                for index, error in enumerate(generation_errors)
+                if error is not None and bool(error.get("api_error_skipped"))
+            }


generation_errors from generations.jsonl is only cast in _collect_generations() but _grade_impl() assumes it’s a dict and calls error.get(...), so can we validate/normalize the loaded generation_errors before the loop?

_{Finding type: Type Inconsistency | Severity: 🟢 Low}

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In llm_behavior_eval/evaluation_utils/free_text_bias_evaluator.py around lines 414-418 inside FreeTextBiasEvaluator._grade_impl, you currently build generation_errors via `generation_record.generation_errors or [...]` and then call `error.get("api_error_skipped")` when computing skipped_indices. Refactor by validating/normalizing generation_errors before the skipped_indices comprehension: ensure it is a list with length matching len(generation_record.answers), and for each entry only keep it if it is dict-like (has a .get); otherwise treat it as None. If the overall type/length is clearly wrong, fail fast with a clear schema error (ValueError) rather than letting grading crash with AttributeError.

Add API support for evaluated models

b37fa71

mishana added the codex label Jan 19, 2026 — with ChatGPT Codex Connector

cursor Bot reviewed Jan 19, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/api_eval_engine.py Outdated

mishana added 2 commits January 27, 2026 18:51

mishana marked this pull request as ready for review January 27, 2026 18:54

mishana requested review from RotemHir, benglewis, orr-hirundo and shmuelyo as code owners January 27, 2026 18:54

chatgpt-codex-connector Bot reviewed Jan 27, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/api_eval_engine.py Outdated

benglewis changed the title ~~Add API-backed support for evaluated models (model_engine='api')~~ LLM-63: Add API-backed support for evaluated models (model_engine='api') Jan 27, 2026

mishana added 3 commits January 27, 2026 20:11

cursor Bot reviewed Jan 27, 2026

View reviewed changes

benglewis reviewed Jan 27, 2026

View reviewed changes

shmuelyo reviewed Jan 28, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluate.py Outdated

shmuelyo reviewed Jan 28, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluate.py

shmuelyo previously approved these changes Jan 28, 2026

View reviewed changes

orr-hirundo requested changes Jan 28, 2026

View reviewed changes

greptile-apps Bot reviewed Jan 29, 2026

View reviewed changes

benglewis self-assigned this Jan 29, 2026

benglewis changed the title ~~LLM-63: Add API-backed support for evaluated models (model_engine='api')~~ LLM-63: Add API-backed support for evaluated models (model_engine='api') Feb 1, 2026

baz-reviewer Bot added baz approved and removed baz approved labels Feb 25, 2026

baz-reviewer Bot approved these changes Feb 25, 2026

View reviewed changes

orr-hirundo reviewed Feb 25, 2026

View reviewed changes

Collapse EvalEngine hierarchy into PromptEvalEngine

6b9fcf7

baz-reviewer Bot removed the baz approved label Mar 1, 2026

cursor Bot reviewed Mar 1, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/custom_dataset.py

Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py

Comment thread llm_behavior_eval/evaluation_utils/base_evaluator.py Outdated

mishana added 2 commits March 2, 2026 11:45

Merge main into codex/add-support-for-api-gated-judge-models

2339fa1

Clean up dataset preprocess signature and judge multimodal formatting

6f791fd

baz-reviewer Bot added the baz approved label Mar 2, 2026

baz-reviewer Bot approved these changes Mar 2, 2026

View reviewed changes

mishana added 2 commits March 2, 2026 13:32

Restore tokenizer-aware truncation for local eval engines

6afd2d0

Refactor transformers batch-size probe flow

d74d3fe

baz-reviewer Bot removed the baz approved label Mar 2, 2026

baz-reviewer Bot reviewed Mar 2, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/transformers_eval_engine.py Outdated

Comment thread llm_behavior_eval/evaluation_utils/transformers_eval_engine.py

Comment thread llm_behavior_eval/evaluation_utils/transformers_eval_engine.py

Refactor local truncation helper and harden probe safety

1bc9ee7

baz-reviewer Bot added the baz approved label Mar 2, 2026

baz-reviewer Bot approved these changes Mar 2, 2026

View reviewed changes

Format eval engine tests for CI ruff check

adebff9

baz-reviewer Bot added baz approved and removed baz approved labels Mar 2, 2026

baz-reviewer Bot approved these changes Mar 2, 2026

View reviewed changes

mishana and others added 6 commits March 3, 2026 21:39

Merge origin/main into codex/add-support-for-api-gated-judge-models

f147ab9

Merge branch 'main' into codex/add-support-for-api-gated-judge-models

e5cb62f

# Conflicts: # uv.lock

Fix various API usage issues

2f1b486

Fix LiteLLM logs being very noisy

ba0d1e5

Improve API call robustness

f4d35b0

Fix various API bugs Improve error handling by adding retries of up-to 3 times of transient errors Raise on non-transient errors rather than hiding them silently Make all of this configurable Make AGENTS.md confuse agents less

baz-reviewer Bot removed the baz approved label May 7, 2026

baz-reviewer Bot reviewed May 7, 2026

View reviewed changes

Conversation

mishana commented Jan 19, 2026 • edited by baz-reviewer Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Motivation

Description

Testing

Generated description

Uh oh!

mishana commented Jan 19, 2026

Uh oh!

Uh oh!

mishana commented Jan 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mishana commented Jan 27, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

benglewis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shmuelyo left a comment

Choose a reason for hiding this comment

Uh oh!

orr-hirundo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jan 29, 2026

Greptile Overview

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jan 29, 2026

Uh oh!

orr-hirundo left a comment

Choose a reason for hiding this comment

Uh oh!

orr-hirundo Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

mishana Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

mishana Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mishana commented Jan 19, 2026 •

edited by baz-reviewer Bot

Loading