LLM-72: Add radar chart plot generation support with Plotly by benglewis · Pull Request #105 · Hirundo-io/llm-behavior-eval

benglewis · 2026-02-22T10:38:48Z

User description

Note

Medium Risk
Adds new optional runtime behavior and dependencies (Plotly/MLflow artifact logging) and changes CI install/resolution behavior, which could affect environments and dependency resolution despite limited impact on the core evaluation path.

Overview
Adds optional radar-chart generation to the llm-behavior-eval CLI via --plot (and --save-plot-to-mlflow when --use-mlflow is enabled), generating Plotly HTML artifacts from summary_full.csv after a successful evaluation and tolerating plotting failures without failing the run.

Introduces a new plotly extra and evaluation_utils/plotting.py helper (draw_radar_chart) plus unit tests for plotting and new CLI validation/behavior tests. Updates dependency constraints in pyproject.toml and adjusts GitHub Actions to install the dev dependency group (and vulnerability scanning to sync all groups/extras with lowest-direct resolution).

^{Written by Cursor Bugbot for commit 3618c43. This will update automatically on new commits. Configure here.}

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Enable optional radar chart plotting from evaluate.main, validating Plotly dependencies, summarizing results, and optionally logging artifacts so the evaluation CLI stays resilient to plotting failures. Add reusable plotting helpers, unit tests, and dependency guidance so generated radar charts can be rendered and stored across environments.

Topic Details

Radar plotting flow

Enable optional radar chart generation after evaluation by wiring the CLI flags, dependency checks, summary parsing, and tolerant plot invocation into evaluate.main, while documenting the flow with new regression tests and a dedicated plotting.draw_radar_chart utility that renders Plotly radar visuals and supports MLflow artifact uploads.

Modified files (4)

llm_behavior_eval/evaluate.py
llm_behavior_eval/evaluation_utils/plotting.py
tests/test_evaluate_cli.py
tests/test_plotting.py

Latest Contributors(2)

User	Commit	Date
eliran@hirundo.io	LLM-80: Upload results...	March 24, 2026
orr@hirundo.io	LLM-76: Add option to ...	March 02, 2026

Other

Other files

Modified files (1)

uv.lock

Latest Contributors(2)

User	Commit	Date
eliran@hirundo.io	LLM-80: Upload results...	March 24, 2026
orr@hirundo.io	LLM-74: Upgrade vllm R...	February 23, 2026

Dependency sync

Align dependency metadata, developer guidance, and CI install scripts so the new plotly extra and broader dev toolchain are installed consistently across automation, documentation, and vulnerability scans.

Modified files (6)

.github/workflows/pyright.yaml
.github/workflows/ruff.yaml
.github/workflows/tests.yaml
.github/workflows/vulnerability-scan.yaml
AGENTS.md
pyproject.toml

Latest Contributors(2)

User	Commit	Date
mishana4life@gmail.com	LLM-64: Change results...	February 16, 2026
blewis@hirundo.io	LLM-29: Add optional v...	November 04, 2025

This pull request is reviewed by Baz. Review like a pro on (Baz).

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

…ing.py`

…rability scanning with `--resolution lowest-direct` works nicely

# Conflicts: # AGENTS.md # pyproject.toml # uv.lock

baz-reviewer · 2026-03-10T16:14:10Z

Spec Reviewer Report 📬

Checkout in Baz

1 / 2 requirements met for ticket:

Add charts of metrics using plotly

1 unmet requirement

#	Requirement	Explanation
1	Radar chart utility logs plots to MLflow artifacts - Not Met	The new plotly helper simply writes HTML or image files and its signature/docstring (llm_behavior_eval/evaluation_utils/plotting.py:56-67) does not accept any MLflow-related flag or perform logging, so it cannot log artifacts directly without extra code elsewhere. evidence plotting.py:56-67 signature lacks MLflow logging flag plotting.py:131-136 writes HTML/image only, no MLflow interaction

1 met requirement

#	Requirement	Explanation
1	Plotly radar chart accepts custom categories and metric labels	The new Plotly helper accepts arbitrary category lists/series and an optional metric-name mapping, and evaluate.py now feeds dataset names into the helper and uploads the saved HTML when MLflow logging is requested. evidence llm_behavior_eval/evaluation_utils/plotting.py:56-137 `draw_radar_chart` requires caller-provided categories, supports multiple `value_lists`, and honors the `metric_name_mapping`/title args for custom labels. llm_behavior_eval/evaluate.py:66-127 `_plot_metrics_from_summary` extracts dataset names for categories, calls the Plotly helper for each metric, and conditionally logs the generated HTML to MLflow when `save_plot_to_mlflow` is true.

Note: Some optional integrations are missing, so it might not be possible to check some of the requirements.
For best results, make sure the following are integrated: Figma

Used resources:
Hash: 4226731 | Ticket: link

To rerun the Spec Reviewer, comment "baz rerun spec review".

…harts-of-metrics-using-plotly # Conflicts: # pyproject.toml # uv.lock

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42267314cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

baz-reviewer · 2026-03-31T20:24:27Z

+def _plot_metrics_from_summary(
+    result_dir: Path,
+    model_path_or_repo_id: str,
+    save_plot_to_mlflow: bool,
+) -> None:
+    import pandas as pd
+
+    from llm_behavior_eval.evaluation_utils.plotting import draw_radar_chart
+
+    model_slug = model_path_or_repo_id.split("/")[-1]
+    summary_file_path = result_dir / model_slug / "summary_full.csv"
+    if not summary_file_path.exists():


_plot_metrics_from_summary derives the slug from model_path_or_repo_id and therefore misses summary_full.csv when EvaluationConfig writes to results_dir/model_slug resolved by get_model_slug(); should we derive the slug from the resolved model_output_dir by reusing EvaluationConfig/BaseEvaluator.get_model_slug() or by passing model_output_dir?

_{Finding type: Logical Bugs | Severity: 🔴 High}

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In llm_behavior_eval/evaluate.py around lines 66-85, the _plot_metrics_from_summary function builds the path to summary_full.csv using model_path_or_repo_id.split('/')[-1], which misses cases where EvaluationConfig writes summaries to a different model_slug (for example when --model-output-dir is provided or a LoRA suffix is appended). Change the function to accept or compute the actual model_output_dir/model_slug used by EvaluationConfig: either (a) accept a model_output_dir or model_slug parameter passed from the caller (the resolved output dir), or (b) reuse EvaluationConfig/BaseEvaluator.get_model_slug() to derive the slug from the resolved output directory (including LoRA suffix). Update path resolution so summary_full.csv is read from the exact summary directory, and add a fallback/error log if the file is not found. Ensure callers are updated to pass the resolved model_output_dir if needed.

@codex Please fix this

Summary

Fixed _plot_metrics_from_summary to consume a resolved model_slug directly, so it reads summary_full.csv from the actual model output folder instead of re-deriving from model_path_or_repo_id. llm_behavior_eval/evaluate.pyL66-L76

Updated the plotting call in main() to pass evaluator.get_model_slug(), aligning plot artifact lookup with evaluator output path logic (including custom model_output_dir / LoRA-derived slugs). llm_behavior_eval/evaluate.pyL679-L685

Updated plotting tests to the new function signature and added get_model_slug() to the stub evaluator so the plotting path call is validated correctly in CLI tests. tests/test_evaluate_cli.pyL24-L27 tests/test_evaluate_cli.pyL528-L549 tests/test_evaluate_cli.pyL632-L636 tests/test_evaluate_cli.pyL707-L711

Testing

✅ uv run ruff check llm_behavior_eval/evaluate.py tests/test_evaluate_cli.py

✅ uv run basedpyright llm_behavior_eval/evaluate.py tests/test_evaluate_cli.py

✅ uv run pytest tests/test_evaluate_cli.py -k "main_calls_plotting_when_enabled or main_keeps_success_when_plotting_fails or plot_metrics_from_summary_excludes_nan_values_per_metric or plot_metrics_from_summary_skips_when_dataset_column_missing"

Committed as a3dd10b, and I also created the follow-up PR via make_pr.

View task →

Remove unnecessary docs Plotly dependency for LLM-72

9fadc87

benglewis added the codex label Feb 22, 2026 — with ChatGPT Codex Connector

cursor Bot reviewed Feb 22, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluate.py Outdated

Comment thread tests/test_plotting.py

benglewis self-assigned this Feb 22, 2026

benglewis added 2 commits February 22, 2026 15:12

Fix uv.lock

2ec8929

Fix Cursor's bugbot's PR comments

fad7840

cursor Bot reviewed Feb 22, 2026

View reviewed changes

Comment thread .github/workflows/vulnerability-scan.yaml Outdated

Comment thread llm_behavior_eval/evaluate.py Outdated

Drop locked flag from vulnerability-scan and lazy import pandas

891d25c

cursor Bot reviewed Feb 22, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluation_utils/plotting.py

benglewis added 3 commits February 22, 2026 16:49

Fix Cursor's bugbot's PR comment about empty value_lists and `plott…

447e2b8

…ing.py`

Switch to latest major version for most packages to ensure that vulne…

27f0d3a

…rability scanning with `--resolution lowest-direct` works nicely

Fix uv sync --resolution lowest-direct issues

3618c43

benglewis force-pushed the codex/2026-02-22/linear-mention-llm-72-add-charts-of-metrics-using-plotly branch from df5be71 to 3618c43 Compare February 22, 2026 15:47

benglewis changed the title ~~LLM-72 Remove unnecessary Plotly install from docs dependency group~~ LLM-72: Add plot generation support with Plotly Feb 25, 2026

benglewis changed the title ~~LLM-72: Add plot generation support with Plotly~~ LLM-72: Add radar chart plot generation support with Plotly Feb 25, 2026

benglewis added 2 commits March 10, 2026 17:55

Handle vllm not working on macOS (since it needs to be compiled)

e33dae9

# Conflicts: # AGENTS.md # pyproject.toml # uv.lock

Check for plotly before running llm-behavior-eval before running

4226731

benglewis marked this pull request as ready for review March 11, 2026 00:51

benglewis requested review from RotemHir, mishana, orr-hirundo and shmuelyo as code owners March 11, 2026 00:51

Merge branch 'main' into codex/2026-02-22/linear-mention-llm-72-add-c…

048e1b6

…harts-of-metrics-using-plotly # Conflicts: # pyproject.toml # uv.lock

chatgpt-codex-connector Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluate.py Outdated

Filter NA metrics when generating radar plots

2f15431

baz-reviewer Bot reviewed Mar 16, 2026

View reviewed changes

Comment thread llm_behavior_eval/evaluate.py

benglewis added 2 commits March 17, 2026 13:04

Add test for missing Dataset plotting guard

9e49b35

Harden missing Dataset plotting test

1da0a89

baz-reviewer Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread tests/test_evaluate_cli.py

Generalize metric plotting columns

0d05ffd

baz-reviewer Bot reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM-72: Add radar chart plot generation support with Plotly#105

LLM-72: Add radar chart plot generation support with Plotly#105
benglewis wants to merge 14 commits into
mainfrom
codex/2026-02-22/linear-mention-llm-72-add-charts-of-metrics-using-plotly

benglewis commented Feb 22, 2026 •

edited by baz-reviewer Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

baz-reviewer Bot commented Mar 10, 2026

Add charts of metrics using plotly

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baz-reviewer Bot Mar 31, 2026

Uh oh!

benglewis Mar 31, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

benglewis commented Feb 22, 2026 • edited by baz-reviewer Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Generated description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

baz-reviewer Bot commented Mar 10, 2026

Spec Reviewer Report 📬

1 / 2 requirements met for ticket:

Add charts of metrics using plotly

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baz-reviewer Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

benglewis Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Choose a reason for hiding this comment

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benglewis commented Feb 22, 2026 •

edited by baz-reviewer Bot

Loading