SDK-91: Refactor CLI into modular subcommands for eval, unlearning, and dataset-qa#227
SDK-91: Refactor CLI into modular subcommands for eval, unlearning, and dataset-qa#227mishana wants to merge 13 commits into
Conversation
Implement three new typer sub-apps following the skills.py pattern: - `hirundo eval run/list/check` — launch/list/monitor LLM behavior eval runs via LlmBehaviorEval.launch_eval_run / check_run_by_id / list_runs - `hirundo dataset-qa run/list/check` — launch/list/monitor Dataset QA runs via QADataset.launch_qa_run / check_run_by_id / list_runs - `hirundo unlearning run/list/check` — launch/list/monitor LLM unlearning runs via LlmUnlearningRun.launch / check_run_by_id / list Each `run` command accepts --wait/--no-wait (default: wait) to either stream tqdm progress or immediately return the run ID for scripting. Closes #220 https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
Extract repeated docs/epilog/console/Typer setup from the three CLI sub-apps into hirundo/_cli_common.py: - hirundo_epilog, docs, console — computed/instantiated once - make_app() — factory for sub-app Typer instances - validate_enum() — replaces copy-paste try/except enum coercion Remove verbose module docstrings that duplicated typer --help output. Remove local Console() instantiation from cli.py list_runs(). https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
| def validate_enum(value: str, enum_cls, label: str): | ||
| try: | ||
| return enum_cls(value.upper()) | ||
| except ValueError: | ||
| valid = ", ".join(e.value for e in enum_cls) |
There was a problem hiding this comment.
The shared CLI helper still uses the 1-character variable e and the same short-name refactor shows up in hirundo/cli_eval.py and hirundo/cli_unlearning.py, conflicting with AGENTS.md naming rules—should we rename those loop variables to descriptive identifiers?
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
hirundo/_cli_common.py around lines 26-30 within the `validate_enum` function, replace
the 1-character generator variable `e` used in `', '.join(e.value for e in enum_cls)`
with a descriptive name (e.g., `enum_member`/`option`) to comply with `AGENTS.md`’s
`Coding Style & Naming` rule. Then scan `hirundo/cli_eval.py` (`col`, `run`) and
`hirundo/cli_unlearning.py` (`col`, `run`) for the same 1–3 character loop variable
pattern in similar join/comprehension logic and rename those variables to descriptive
identifiers as well, keeping the output formatting and error message text identical.
There was a problem hiding this comment.
Commit 9a4feff addressed this comment by replacing the single-character generator variable in validate_enum with a descriptive identifier, aligning the shared helper with the AGENTS naming guidelines even though the CLI-specific files were untouched.
| run_id = QADataset.launch_qa_run(dataset_id) | ||
| console.print(f"Dataset QA run started. Run ID: [bold]{run_id}[/bold]") | ||
|
|
||
| if wait: | ||
| QADataset.check_run_by_id(run_id) |
There was a problem hiding this comment.
dataset_qa_run duplicates the launch/print/wait branching in eval_run and unlearning_run — should we extract a shared helper, and also capture the return value of QADataset.check_run_by_id(run_id) to surface results.cached_zip_path (matching the check-run output in cli.py:199-200)?
Finding types: Code Dedup and Conventions Breaking Changes | Severity: 🟢 Low
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. 1. Extract the shared
wait/notification branching logic from `dataset_qa_run`, `eval_run`, and
`unlearning_run` into a centralized helper that accepts launch/monitor functions and a
command name. 2. In `dataset_qa_run` (lines 27-28), capture the return value of
`QADataset.check_run_by_id(run_id)` as `results` and print the legacy completion line
including `results.cached_zip_path`, matching the wording/format used by the existing
`check-run` output in cli.py:199-200. 3. Update `dataset_qa_check` (lines 63-72) to
print the same results-location line when streaming completes, so both commands provide
a stable contract for scripts/users.
There was a problem hiding this comment.
Commit 36663cf addressed this comment by introducing the shared wait_or_notify helper so dataset-qa, eval, and unlearning runs reuse the same wait/notify branching. dataset_qa_run now captures the check_run_by_id results and prints the cached_zip_path, and dataset_qa_check also emits the same results-location line for streaming completion, matching the check-run output contract.
| for run in runs: | ||
| table.add_row( | ||
| str(run.name), | ||
| str(run.run_id), | ||
| str(run.status), |
There was a problem hiding this comment.
The table.add_row(...) iteration block in dataset_qa_list is copy/pasted from eval_list and unlearning_list, so should we extract a shared renderer by passing column labels and a row-mapping callable?
Finding type: Code Dedup and Conventions | Severity: 🟢 Low
There was a problem hiding this comment.
Commit 36663cf addressed this comment by centralizing the run-table rendering into the new print_runs_table helper in _cli_common and wiring dataset_qa_list (alongside eval and unlearning lists) to call it with the column labels and row data.
| def dataset_qa_check( | ||
| run_id: Annotated[str, typer.Argument(help="The run ID to check.")], | ||
| ): | ||
| """ | ||
| Check the status of a Dataset QA run and stream progress. | ||
| """ | ||
| from hirundo.dataset_qa import QADataset | ||
|
|
||
| QADataset.check_run_by_id(run_id) |
There was a problem hiding this comment.
Since CLI run_id flows into QADataset.check_run_by_id(run_id) and ends up as ~/.hirundo/cache/{run_id}.zip, should we validate/sanitize run_id (e.g., disallow path separators) to prevent writing outside the cache directory?
Finding type: Basic Security Patterns | Severity: 🔴 High
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
`hirundo/cli_dataset_qa.py` around lines 64-72, in the `dataset_qa_check(run_id:
Annotated[str, ...])` command, validate the CLI-provided `run_id` before calling
`QADataset.check_run_by_id(run_id)`. Reject any values containing path separators (`/`
or `\`), `..`, or any characters outside the expected run-id format (e.g., enforce a
strict regex such as alphanumeric/UUID). If validation fails, print a clear error and
exit without invoking `QADataset` so the downstream cache write cannot be used for path
traversal.
There was a problem hiding this comment.
Commit 23e9cc8 addressed this comment by adding a validate_run_id helper that enforces an alphanumeric/_- regex, reports invalid input, and exits before invoking QADataset. The dataset_qa_check command now uses this helper so the CLI-run ID is sanitized before any cache write occurs.
| def eval_check( | ||
| run_id: Annotated[str, typer.Argument(help="The run ID to check.")], | ||
| ): | ||
| """ | ||
| Check the status of an LLM behavior evaluation run and stream progress. | ||
| """ | ||
| from hirundo.llm_behavior_eval import LlmBehaviorEval | ||
|
|
||
| LlmBehaviorEval.check_run_by_id(run_id) |
There was a problem hiding this comment.
run_id from the CLI is used directly for ~/.hirundo/cache/{run_id}.zip in download_and_extract_llm_behavior_eval_zip(), so should we sanitize/validate it to block path separators before check_run_by_id(run_id)?
Finding type: Basic Security Patterns | Severity: 🔴 High
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
`hirundo/cli_eval.py` around lines 99-108, in the `eval_check` CLI command, the `run_id`
argument is accepted as an arbitrary string and passed directly to
`LlmBehaviorEval.check_run_by_id(run_id)`, which downstream uses it to form a cache file
path. Add strict validation for `run_id` before calling `check_run_by_id`: reject any
value containing path separators (`/` or `\`), `..`, or other unsafe characters, and
ideally constrain it to the expected run-id format (e.g., UUID/token regex). If
validation fails, print a clear error and exit with code 1 so a crafted `hirundo eval
check` argument cannot traverse outside `~/.hirundo/cache/`.
There was a problem hiding this comment.
Commit 23e9cc8 addressed this comment by adding a run-id validator in _cli_common.py and wiring eval_check through validate_run_id(run_id) before calling check_run_by_id, ensuring only safe characters are accepted and the CLI exits with an error on invalid input.
|
|
||
| runs = QADataset.list_runs() | ||
|
|
||
| console = Console() | ||
| table = Table( | ||
| title="Runs:", | ||
| expand=True, |
There was a problem hiding this comment.
The legacy “list-runs” table is rendering the wrong run identifier — should we switch its Rich Table to use runs = QADataset.list_runs() and review the legacy row/column rendering (per hirundo/cli.py:228-232) to ensure it shows run.run_id or clearly labels the numeric DB id?
Finding type: Logical Bugs | Severity: 🟢 Low
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In `hirundo/cli.py`
around lines 209-214 inside the `list_runs()` function, the Rich `Table` is configured
with a “Run ID” column but the implementation for the row identifier is using the
wrong attribute (currently `run.id`). Refactor the table row construction so the “Run
ID” column uses the correct identifier from `QADataset.list_runs()` (use `run.run_id`
if it exists); if only a numeric DB id is available, explicitly rename the column/header
to match (e.g., “DB ID”) so the output is not misleading. Then verify that the
rendered rows match the identifier shown by the `dataset-qa list` implementation.
There was a problem hiding this comment.
Commit 36663cf addressed this comment by moving the Dataset QA list command into hirundo/cli_dataset_qa.py and printing the table with the run.run_id attribute (via the new print_runs_table helper) instead of run.id, so the "Run ID" column now matches the dataset-qa list output.
There was a problem hiding this comment.
Commit 63c2723 addressed this comment by updating the Legacy Runs table rows to display run.run_id, ensuring the Run ID column now matches the identifier shown by dataset-qa list.
ruff: - Rename make_app `help` param to `help_text` (A002 builtin shadow) - raise typer.Exit(code=1) from None in validate_enum (B904) - Remove unused Optional imports now replaced by X | None (F401) - Fix import ordering (I001, auto-fixed) - Convert Optional[X] to X | None (UP045, auto-fixed) basedpyright: - Use elif hallucination_type is not None + exhaustive else to let pyright narrow the str | None type before passing to validate_enum module-docs: - Add .rst files for the three new public CLI modules https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
| if wait: | ||
| LlmBehaviorEval.check_run_by_id(run_id) | ||
| else: | ||
| console.print( | ||
| "Use [bold]hirundo eval check[/bold] [italic]<run_id>[/italic] to monitor progress." |
There was a problem hiding this comment.
The run subcommands duplicate the same --wait/notification branch across hirundo/cli_eval.py, hirundo/cli_unlearning.py, and hirundo/cli_dataset_qa.py, so should we extract a shared helper to centralize the wait handling and messaging?
Finding type: Code Dedup and Conventions | Severity: 🟢 Low
There was a problem hiding this comment.
Commit 36663cf addressed this comment by introducing the shared wait_or_notify helper in hirundo/_cli_common.py and using it in the eval, dataset-qa, and unlearning run commands to centralize the wait/notification logic and messaging.
run_id flows into cache_dir / f"{run_id}.zip" in unzip.py.
A malicious value like ../../evil would escape ~/.hirundo/cache/.
Add validate_run_id() to _cli_common which rejects anything outside
[a-zA-Z0-9_-] (covers UUIDs and integer IDs). Apply it in the check
command of each sub-app before passing run_id to the SDK.
https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
| run_info = EvalRunInfo( | ||
| model_id=model_id, | ||
| source_run_id=source_run_id, | ||
| preset_type=preset_type, | ||
| name=name, |
There was a problem hiding this comment.
--source-run-id flows into EvalRunInfo and LlmBehaviorEval.launch_eval_run() without validate_run_id(), so it accepts values eval check/unlearning check reject — should we validate source_run_id with the shared rule before launching?
Finding type: Logical Bugs | Severity: 🟢 Low
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In hirundo/cli_eval.py
around lines 70-74 where `--source-run-id` is copied into `EvalRunInfo` and passed to
`LlmBehaviorEval.launch_eval_run()`, add validation for `source_run_id` by running it
through the shared `validate_run_id()` (the same one imported from
`hirundo._cli_common`). Refactor so the value is validated before constructing
`EvalRunInfo` (or at least immediately before calling `launch_eval_run`), and ensure any
invalid IDs fail consistently with `eval check <run_id>`/`unlearning check <run_id>`.
After the change, `--source-run-id` should be rejected with the same rules/help behavior
as other run-id parameters.
There was a problem hiding this comment.
Commit cc2e8bf addressed this comment by validating source_run_id via validate_run_id() before constructing EvalRunInfo, ensuring the CLI now rejects invalid source run IDs just like the other run-id parameters.
| def validate_run_id(run_id: str) -> str: | ||
| if not _RUN_ID_RE.match(run_id): | ||
| console.print( | ||
| f"[red]Invalid run ID '{run_id}'. " | ||
| "Run IDs may only contain alphanumeric characters, hyphens, and underscores.[/red]" | ||
| ) | ||
| raise typer.Exit(code=1) from None |
There was a problem hiding this comment.
validate_run_id() uses match() which can accept trailing newlines — should we switch to fullmatch(), and add unit tests in tests/test_cli_common.py covering valid/invalid inputs?
Finding types: Breaking Changes Add tests for behavior | Severity: 🟢 Low
Prompt for AI Agents:
In hirundo/_cli_common.py around lines 29-35, in the validate_run_id(run_id: str)
function: 1. Replace the `_RUN_ID_RE.match` call with `fullmatch` so the entire string
must consist of exactly the allowed characters, preventing trailing newlines or other
whitespace from slipping through. Ensure the error path still prints the same red error
message and that `validate_run_id` returns the original `run_id` only when it is
strictly canonical. 2. Add a focused unit test in `tests/test_cli_common.py` that
verifies: a valid run ID (containing only letters/numbers/hyphens/underscores) returns
the input unchanged, and an invalid run ID (e.g., containing a space, newline, or other
disallowed character) raises `typer.Exit` with code 1 and that the console output
contains `Invalid run ID '<value>'` and the "may only contain…" text. Use
whatever test utilities the project already uses to capture Rich/console output so the
assertion is deterministic.
There was a problem hiding this comment.
Commit 67dc58a addressed this comment by switching validate_run_id to rely on _RUN_ID_RE.fullmatch, ensuring trailing newlines no longer slip through; the requested tests for valid/invalid inputs still need to be added.
…ace cached_zip_path Adds two shared helpers to _cli_common to eliminate copy-paste across the three CLI sub-apps: wait_or_notify centralises the wait/notify branching, and print_runs_table centralises Rich table construction. Also captures and prints results.cached_zip_path for dataset-qa and eval check/run commands, matching the legacy check-run behaviour. https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
The "Run ID" column was displaying the numeric DB id (run.id) instead of the string run identifier (run.run_id), inconsistent with the dataset-qa list command. https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
Applies the same validate_run_id() check used by eval check/unlearning check so invalid run ID formats are rejected consistently before launching. https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
…ines re.match with $ allows a single trailing newline through; fullmatch requires the entire string to consist of only the allowed characters. https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
bare `type` is not iterable; `type[Enum]` lets basedpyright know the class supports iteration via EnumMeta. https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
| runs = QADataset.list_runs(archived=archived) | ||
| print_runs_table( | ||
| "Dataset QA Runs:", | ||
| ("Dataset Name", "Run ID", "Status", "Created At", "Run Args"), | ||
| [ |
There was a problem hiding this comment.
dataset_qa_list duplicates QADataset.list_runs + table formatting from hirundo/cli.py:list_runs, should we reuse the existing table helper or delegate to list_runs for a single rendering source of truth?
Finding type: Code Dedup and Conventions | Severity: 🟢 Low
There was a problem hiding this comment.
Commit 9a4feff addressed this comment by importing dataset_qa_list into hirundo/cli and delegating the top-level list command to dataset_qa_list(archived=False), ensuring the Dataset QA table rendering now comes from a single shared helper.
| def wait_or_notify( | ||
| run_id: str, check_fn: Callable[[str], Any], cmd_name: str, wait: bool | ||
| ) -> Any: | ||
| if wait: | ||
| return check_fn(run_id) | ||
| console.print( | ||
| f"Use [bold]hirundo {cmd_name} check[/bold] [italic]<run_id>[/italic] to monitor progress." | ||
| ) | ||
| return None |
There was a problem hiding this comment.
wait_or_notify changed visible behavior, but the diff lacks focused tests for the wait=True/wait=False branches (including the emitted prompt), so should we add a fast unit test that covers both cases and asserts the return propagation/message?
Finding type: Add tests for behavior | Severity: 🟢 Low
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
hirundo/_cli_common.py around lines 52-60, the `wait_or_notify` helper introduces new
visible behavior for both `wait=True` (returns `check_fn(run_id)`) and `wait=False`
(prints the “Use hirundo … check …” guidance and returns None), but there are no
focused tests covering these branches. Add a fast unit test that calls `wait_or_notify`
directly with a stub `check_fn` and asserts the return value is propagated when
`wait=True`, and when `wait=False` it asserts the exact printed message (or a stable
substring) is emitted and the return is None. Reuse the repo’s existing test/CLI
helper utilities and rich `console` capturing mechanism if available (e.g., a fixture
that captures stdout/console output), and keep the test isolated from network/IO.
There was a problem hiding this comment.
Commit 9a4feff addressed this comment by adding focused tests for wait_or_notify, covering the wait=True return propagation and the wait=False console prompt/hint output.
Replaces inline Rich Table construction with the shared print_runs_table helper, removing the last duplication between list-runs and dataset-qa list. https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
| from hirundo.dataset_qa import QADataset | ||
|
|
||
| runs = QADataset.list_runs() | ||
|
|
||
| console = Console() | ||
| table = Table( | ||
| title="Runs:", | ||
| expand=True, | ||
| ) | ||
| cols = ( | ||
| "Dataset name", | ||
| "Run ID", | ||
| "Status", | ||
| "Created At", | ||
| "Run Args", | ||
| print_runs_table( | ||
| "Runs:", | ||
| ("Dataset name", "Run ID", "Status", "Created At", "Run Args"), | ||
| [ | ||
| ( | ||
| str(run.name), | ||
| str(run.run_id), | ||
| str(run.status), | ||
| run.created_at.isoformat(), |
There was a problem hiding this comment.
list_runs duplicates dataset_qa_list by calling QADataset.list_runs and rendering via print_runs_table — should we extract a shared helper, and add a focused test patching QADataset.list_runs() with deterministic data to assert print_runs_table is called with the expected headers/rows (including the run.run_args is None case)?
Finding types: Code Dedup and Conventions Add tests for behavior | Severity: 🟢 Low
Prompt for AI Agents:
In hirundo/cli.py around lines 206-219, `list_runs` duplicates the listing logic from
`dataset_qa_list`. Consider extracting a shared helper for calling
`QADataset.list_runs()` and rendering via `print_runs_table(...)`. Additionally, add a
focused test that patches `hirundo.cli.QADataset.list_runs()` to return a deterministic
list of run-like objects (with name, status, created_at.isoformat(), and
run_args.model_dump_json() or None), and patches `hirundo.cli.print_runs_table` to
assert it was called once with title "Runs:" and headers ("Dataset name", "Run ID",
"Status", "Created At", "Run Args") and rows matching the expected tuple values. Ensure
the test also covers the case where `run.run_args` is None so the last field is None.
Place the test in the appropriate existing test module for the CLI.
There was a problem hiding this comment.
Commit 9a4feff addressed this comment by delegating the CLI list_runs command to dataset_qa_list, so the shared helper now controls fetching the runs and printing the table, which removes the duplicated logic. The focused test patching QADataset.list_runs/print_runs_table mentioned in the comment still needs to be added separately.
- Rename loop variable `e` -> `member` in validate_enum - Delegate legacy list-runs to dataset_qa_list to eliminate duplicate table rendering - Add validate_run_id + None guard to legacy check-run for consistency - Add tests/test_cli_common.py covering validate_run_id (valid/invalid cases) and wait_or_notify (both wait=True/False branches) https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
User description
Summary
This PR refactors the CLI structure by extracting functionality into separate, modular subcommand modules. The main CLI now delegates to specialized apps for evaluation, unlearning, and dataset QA operations, improving code organization and maintainability.
Key Changes
New
_cli_common.pymodule: Extracted shared CLI utilities including:consoleobject for Rich outputhirundo_epilogconfigurationmake_app()helper for creating consistent Typer appsvalidate_enum()helper for enum validation with user-friendly error messagesNew
cli_eval.pymodule: Extracted evaluation-related commands:eval run: Launch LLM behavior evaluation runs with preset configurationseval list: List evaluation runs with optional filteringeval check: Monitor evaluation run progressNew
cli_unlearning.pymodule: Extracted unlearning-related commands:unlearning run: Launch LLM unlearning runs (bias or hallucination types)unlearning list: List unlearning runs with optional filteringunlearning check: Monitor unlearning run progressNew
cli_dataset_qa.pymodule: Extracted dataset QA commands:dataset-qa run: Launch Dataset QA runsdataset-qa list: List Dataset QA runs with optional filteringdataset-qa check: Monitor Dataset QA run progressUpdated
cli.py:_cli_common.py)Implementation Details
make_app()helpervalidate_enum()for consistent error handling_cli_common.consoleadd_typer()https://claude.ai/code/session_01QGjQJb8G8DM8UaYt66Jewt
Note
Medium Risk
Moderate risk because it restructures the public CLI surface area by introducing new subcommands and centralized argument validation, which can break existing invocation patterns or expose mismatches in run field names.
Overview
Refactors the
hirundoTyper CLI into modular sub-apps and registers new top-level subcommands:hirundo eval,hirundo unlearning, andhirundo dataset-qa.Adds
_cli_common.pyto centralize shared CLI setup (console,hirundo_epilog,make_app) and introducesvalidate_enum()for consistent enum parsing and user-friendly failures.Implements new
run/list/checkcommands for evaluation, unlearning, and dataset QA, including stricter flag validation (mutual exclusivity/required options) and standardized Rich table output.Reviewed by Cursor Bugbot for commit 7b375da. Configure here.
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Refactor the CLI structure to route
hirundo.cli_eval,hirundo.cli_unlearning, andhirundo.cli_dataset_qathrough the main Typer app so each flow gets its ownrun/list/checkcommands. Share_cli_commonhelpers for console output, enum/run validation, and table rendering to keep every subcommand self-contained and consistent.hirundo.cli_eval,hirundo.cli_unlearning, andhirundo.cli_dataset_qaTyper apps on the main CLI to routerun/list/checkflows for evaluation, unlearning, and dataset QA.Modified files (7)
Latest Contributors(1)
_cli_commonhelpers (make_app,validate_enum,validate_run_id,wait_or_notify,print_runs_table,console) to centralize epilog handling, run validation, and table output, and cover them with tests.Modified files (2)
Latest Contributors(1)