Skip to content

SDK-79: Add LLM behavior eval support#189

Merged
benglewis merged 26 commits into
mainfrom
codex/2026-01-13/linear-mention-sdk-79-add-support-for-llm-behavior-eval
Feb 4, 2026
Merged

SDK-79: Add LLM behavior eval support#189
benglewis merged 26 commits into
mainfrom
codex/2026-01-13/linear-mention-sdk-79-add-support-for-llm-behavior-eval

Conversation

@benglewis
Copy link
Copy Markdown
Contributor

@benglewis benglewis commented Jan 13, 2026


Codex Task


Note

Medium Risk
Adds a new API surface for launching/checking LLM behavior eval runs and refactors shared run-status/SSE handling used by existing Dataset QA and unlearning flows, so regressions could affect run monitoring and bias-type compatibility.

Overview
Adds first-class LLM behavior evaluation support via LlmBehaviorEval, including run lifecycle actions (launch/cancel/rename/archive/restore/list), SSE-based progress tracking, and downloading/parsing eval result zips into LlmBehaviorEvalResults (summary_brief/summary_full DataFrames).

Refactors shared SDK primitives: introduces HirundoError, centralizes RunStatus, adds typed SSE payload parsing (SseRunEventData), and extracts LLM model source types into _llm_sources; updates unlearning bias enums to BBQBiasType and adjusts existing run-checking/progress utilities to handle typed events.

Updates docs and tooling: moves README quickstarts into Sphinx docs/*.py examples (including new eval example) and adds API reference pages for new modules; updates dev setup to use uv dependency groups (--group dev) and extends cleanup/test coverage to include archiving behavior-eval test runs.

Written by Cursor Bugbot for commit c1eba15. This will update automatically on new commits. Configure here.

@benglewis

This comment was marked as resolved.

@chatgpt-codex-connector

This comment was marked as outdated.

@benglewis benglewis changed the title SDK-79 Add LLM eval metric models and handle list/dict metrics payloads SDK-79: Add LLM behavior eval support Jan 13, 2026
@benglewis benglewis self-assigned this Jan 13, 2026
@benglewis

This comment was marked as resolved.

@chatgpt-codex-connector

This comment was marked as outdated.

@chatgpt-codex-connector

This comment was marked as outdated.

@benglewis benglewis marked this pull request as ready for review January 28, 2026 22:05
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be354c44a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread hirundo/_run_checking.py
Comment thread hirundo/llm_behavior_eval.py Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 29, 2026

Greptile Overview

Important Files Changed

Filename Overview
hirundo/llm_behavior_eval.py New LLM behavior evaluation client with launch/cancel/rename/archive/restore/get/list operations and both sync/async streaming support
hirundo/init.py Exports new LLM behavior eval types and refactors error class hierarchy
hirundo/_sse_event_data.py Centralized SSE payload parsing with Pydantic models for type safety
hirundo/_run_checking.py Refactored to support both dict and SseRunEventData payloads, moved RunStatus to separate file
hirundo/unzip.py Added download_and_extract_llm_behavior_eval_zip function for LLM eval results and improved variable naming
hirundo/dataset_qa.py Refactored to use base HirundoError class and renamed to HirundoDatasetQaError

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment thread hirundo/_sse_event_data.py Outdated
Comment thread hirundo/unzip.py Outdated
Comment thread hirundo/_run_checking.py Outdated
Comment thread hirundo/_hirundo_error.py
These look good to me :)

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Ben Lewis <hello@blewis.me>
@benglewis benglewis requested review from a team as code owners February 3, 2026 11:57
@chatgpt-codex-connector

This comment was marked as outdated.

Comment thread hirundo/_sse_event_data.py Outdated
mishana
mishana previously approved these changes Feb 3, 2026
Copy link
Copy Markdown
Contributor

@mishana mishana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please see a few comments/questions/suggestions :)

Comment thread hirundo/llm_behavior_eval.py
Comment thread hirundo/llm_bias_type.py Outdated
Comment thread scripts/cleanup_test_artifacts.py
Comment thread tests/llm-behavior-eval/llm_behavior_eval_test.py
Comment thread hirundo/llm_behavior_eval.py
@benglewis benglewis requested a review from mishana February 4, 2026 14:55
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment thread hirundo/llm_behavior_eval.py
Comment thread hirundo/_run_checking.py
Comment thread hirundo/_run_checking.py
mishana
mishana previously approved these changes Feb 4, 2026
Copy link
Copy Markdown
Contributor

@mishana mishana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please see one more comment

Comment thread hirundo/llm_bias_type.py
Copy link
Copy Markdown
Contributor

@mishana mishana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LAGTANISIMUS

@benglewis benglewis added this pull request to the merge queue Feb 4, 2026
Copy link
Copy Markdown

@eliran-hirundo eliran-hirundo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Merged via the queue into main with commit 18fad73 Feb 4, 2026
31 checks passed
@benglewis benglewis deleted the codex/2026-01-13/linear-mention-sdk-79-add-support-for-llm-behavior-eval branch February 5, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants