SDK-79: Add LLM behavior eval support by benglewis · Pull Request #189 · Hirundo-io/hirundo-python-sdk

benglewis · 2026-01-13T13:44:52Z

Note

Medium Risk
Adds a new API surface for launching/checking LLM behavior eval runs and refactors shared run-status/SSE handling used by existing Dataset QA and unlearning flows, so regressions could affect run monitoring and bias-type compatibility.

Overview
Adds first-class LLM behavior evaluation support via LlmBehaviorEval, including run lifecycle actions (launch/cancel/rename/archive/restore/list), SSE-based progress tracking, and downloading/parsing eval result zips into LlmBehaviorEvalResults (summary_brief/summary_full DataFrames).

Refactors shared SDK primitives: introduces HirundoError, centralizes RunStatus, adds typed SSE payload parsing (SseRunEventData), and extracts LLM model source types into _llm_sources; updates unlearning bias enums to BBQBiasType and adjusts existing run-checking/progress utilities to handle typed events.

Updates docs and tooling: moves README quickstarts into Sphinx docs/*.py examples (including new eval example) and adds API reference pages for new modules; updates dev setup to use uv dependency groups (--group dev) and extends cleanup/test coverage to include archiving behavior-eval test runs.

^{Written by Cursor Bugbot for commit c1eba15. This will update automatically on new commits. Configure here.}

…upport-for-llm-behavior-eval

… of `stream_results` and adding `tqdm`, unzipping and loading of CSVs Essentially matching the data QA behavior

…oad to make it more debuggable

… names

…eval test

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be354c44a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

greptile-apps · 2026-01-29T09:22:38Z

Greptile Overview

Important Files Changed

Filename	Overview
hirundo/llm_behavior_eval.py	New LLM behavior evaluation client with launch/cancel/rename/archive/restore/get/list operations and both sync/async streaming support
hirundo/init.py	Exports new LLM behavior eval types and refactors error class hierarchy
hirundo/_sse_event_data.py	Centralized SSE payload parsing with Pydantic models for type safety
hirundo/_run_checking.py	Refactored to support both dict and SseRunEventData payloads, moved RunStatus to separate file
hirundo/unzip.py	Added download_and_extract_llm_behavior_eval_zip function for LLM eval results and improved variable naming
hirundo/dataset_qa.py	Refactored to use base HirundoError class and renamed to HirundoDatasetQaError

greptile-apps

_{4 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

These look good to me :) Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Ben Lewis <hello@blewis.me>

mishana

LGTM, please see a few comments/questions/suggestions :)

…upport-for-llm-behavior-eval

Thank you Cursor (bugbot)

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

mishana

LGTM, please see one more comment

mishana

LAGTANISIMUS

eliran-hirundo

LGTM

Add LLM eval metric models

1c091b4

benglewis added the codex label Jan 13, 2026 — with ChatGPT Codex Connector

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

benglewis changed the title ~~SDK-79 Add LLM eval metric models and handle list/dict metrics payloads~~ SDK-79: Add LLM behavior eval support Jan 13, 2026

benglewis self-assigned this Jan 13, 2026

Format llm_behavior_eval

1a047bb

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

benglewis and others added 2 commits January 13, 2026 23:43

Fix optional type hints in llm behavior eval

06d040c

Merge branch 'main' into codex/2026-01-13/linear-mention-sdk-79-add-s…

202a481

…upport-for-llm-behavior-eval

This comment was marked as outdated.

Sign in to view

benglewis added 8 commits January 28, 2026 21:49

Basic first implementation of matching naming for check_run instead…

415f7e3

… of `stream_results` and adding `tqdm`, unzipping and loading of CSVs Essentially matching the data QA behavior

Add AGENTS.md and new dependency_groups entry of dev for development

b9027d8

Fix RunStatus circular dependency

d141b9d

Drop unnecessary TypeAdapter and add error log for invalid SSE payl…

c62aad3

…oad to make it more debuggable

Update AGENTS.md to use context7 and not use 1-3 character variable…

d42c945

… names

Add assertion for summary_brief and summary_full to LLM behavior …

808aed7

…eval test

Fix SSE payload parsing

c872f89

Try to fix unzip for LLM behavior eval results

be354c4

benglewis marked this pull request as ready for review January 28, 2026 22:05

benglewis requested review from ddishi, mishana and shmuelyo as code owners January 28, 2026 22:05

chatgpt-codex-connector Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread hirundo/_run_checking.py

Comment thread hirundo/llm_behavior_eval.py Outdated

greptile-apps Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread hirundo/_sse_event_data.py Outdated

Comment thread hirundo/unzip.py Outdated

Comment thread hirundo/_run_checking.py Outdated

Comment thread hirundo/_hirundo_error.py

Apply Greptile suggestions from code review

5391f96

These look good to me :) Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Ben Lewis <hello@blewis.me>

benglewis requested review from a team as code owners February 3, 2026 11:57

This comment was marked as outdated.

Sign in to view

SDK-79: Apply ruff format to llm behavior eval

4374468

cursor Bot reviewed Feb 3, 2026

View reviewed changes

Comment thread hirundo/_sse_event_data.py Outdated

mishana previously approved these changes Feb 3, 2026

View reviewed changes

Comment thread hirundo/llm_behavior_eval.py

Comment thread hirundo/llm_bias_type.py Outdated

Comment thread scripts/cleanup_test_artifacts.py

Comment thread tests/llm-behavior-eval/llm_behavior_eval_test.py

benglewis added 2 commits February 4, 2026 16:19

Merge branch 'main' into codex/2026-01-13/linear-mention-sdk-79-add-s…

ea7f62b

…upport-for-llm-behavior-eval

Update README.md and documentation (docs)

6bc3126

benglewis dismissed mishana’s stale review via 6bc3126 February 4, 2026 14:30

benglewis added 2 commits February 4, 2026 16:33

Drop Python code from README.md

263f19f

Fix circular import

6a96100

Thank you Cursor (bugbot)

cursor Bot reviewed Feb 4, 2026

View reviewed changes

Comment thread hirundo/llm_behavior_eval.py

benglewis added 4 commits February 4, 2026 16:41

Rename BiasType to BBQBiasType

97b2431

Fix progress_bar not being closed if there is an error with the run

ba0392a

Add deleted_at to EvalRunRecord

beb20a6

Add cleanup for LLM behavior eval runs

56b538f

benglewis requested a review from mishana February 4, 2026 14:55

cursor Bot reviewed Feb 4, 2026

View reviewed changes

Comment thread hirundo/llm_behavior_eval.py

eliran-hirundo reviewed Feb 4, 2026

View reviewed changes

Comment thread hirundo/_run_checking.py

Comment thread hirundo/_run_checking.py

mishana previously approved these changes Feb 4, 2026

View reviewed changes

Comment thread hirundo/llm_bias_type.py

Add UnqoverBiasType as per @mishana 's PR comment

6c641ec

benglewis dismissed mishana’s stale review via 6c641ec February 4, 2026 15:35

Fix Cursor's bugbot's comment

c1eba15

benglewis requested review from eliran-hirundo and mishana February 4, 2026 15:42

mishana approved these changes Feb 4, 2026

View reviewed changes

benglewis added this pull request to the merge queue Feb 4, 2026

eliran-hirundo reviewed Feb 4, 2026

View reviewed changes

Merged via the queue into main with commit 18fad73 Feb 4, 2026
31 checks passed

benglewis temporarily deployed to github-pages February 4, 2026 18:05 — with GitHub Actions Inactive

benglewis deleted the codex/2026-01-13/linear-mention-sdk-79-add-support-for-llm-behavior-eval branch February 5, 2026 07:54

Conversation

benglewis commented Jan 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Important Files Changed

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

mishana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mishana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mishana left a comment

Choose a reason for hiding this comment

Uh oh!

eliran-hirundo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benglewis commented Jan 13, 2026 •

edited by cursor Bot

Loading

greptile-apps Bot commented Jan 29, 2026 •

edited

Loading