SDK-79: Add LLM behavior eval support#189
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
…upport-for-llm-behavior-eval
This comment was marked as outdated.
This comment was marked as outdated.
… of `stream_results` and adding `tqdm`, unzipping and loading of CSVs Essentially matching the data QA behavior
…oad to make it more debuggable
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be354c44a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Greptile Overview
|
| Filename | Overview |
|---|---|
| hirundo/llm_behavior_eval.py | New LLM behavior evaluation client with launch/cancel/rename/archive/restore/get/list operations and both sync/async streaming support |
| hirundo/init.py | Exports new LLM behavior eval types and refactors error class hierarchy |
| hirundo/_sse_event_data.py | Centralized SSE payload parsing with Pydantic models for type safety |
| hirundo/_run_checking.py | Refactored to support both dict and SseRunEventData payloads, moved RunStatus to separate file |
| hirundo/unzip.py | Added download_and_extract_llm_behavior_eval_zip function for LLM eval results and improved variable naming |
| hirundo/dataset_qa.py | Refactored to use base HirundoError class and renamed to HirundoDatasetQaError |
These look good to me :) Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Ben Lewis <hello@blewis.me>
This comment was marked as outdated.
This comment was marked as outdated.
mishana
left a comment
There was a problem hiding this comment.
LGTM, please see a few comments/questions/suggestions :)
…upport-for-llm-behavior-eval
Thank you Cursor (bugbot)
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
mishana
left a comment
There was a problem hiding this comment.
LGTM, please see one more comment
Codex Task
Note
Medium Risk
Adds a new API surface for launching/checking LLM behavior eval runs and refactors shared run-status/SSE handling used by existing Dataset QA and unlearning flows, so regressions could affect run monitoring and bias-type compatibility.
Overview
Adds first-class LLM behavior evaluation support via
LlmBehaviorEval, including run lifecycle actions (launch/cancel/rename/archive/restore/list), SSE-based progress tracking, and downloading/parsing eval result zips intoLlmBehaviorEvalResults(summary_brief/summary_fullDataFrames).Refactors shared SDK primitives: introduces
HirundoError, centralizesRunStatus, adds typed SSE payload parsing (SseRunEventData), and extracts LLM model source types into_llm_sources; updates unlearning bias enums toBBQBiasTypeand adjusts existing run-checking/progress utilities to handle typed events.Updates docs and tooling: moves README quickstarts into Sphinx
docs/*.pyexamples (including new eval example) and adds API reference pages for new modules; updates dev setup to useuvdependency groups (--group dev) and extends cleanup/test coverage to include archiving behavior-eval test runs.Written by Cursor Bugbot for commit c1eba15. This will update automatically on new commits. Configure here.