Skip to content

LLM-72: Add radar chart plot generation support with Plotly#105

Open
benglewis wants to merge 14 commits into
mainfrom
codex/2026-02-22/linear-mention-llm-72-add-charts-of-metrics-using-plotly
Open

LLM-72: Add radar chart plot generation support with Plotly#105
benglewis wants to merge 14 commits into
mainfrom
codex/2026-02-22/linear-mention-llm-72-add-charts-of-metrics-using-plotly

Conversation

@benglewis
Copy link
Copy Markdown
Contributor

@benglewis benglewis commented Feb 22, 2026

User description


Codex Task


Note

Medium Risk
Adds new optional runtime behavior and dependencies (Plotly/MLflow artifact logging) and changes CI install/resolution behavior, which could affect environments and dependency resolution despite limited impact on the core evaluation path.

Overview
Adds optional radar-chart generation to the llm-behavior-eval CLI via --plot (and --save-plot-to-mlflow when --use-mlflow is enabled), generating Plotly HTML artifacts from summary_full.csv after a successful evaluation and tolerating plotting failures without failing the run.

Introduces a new plotly extra and evaluation_utils/plotting.py helper (draw_radar_chart) plus unit tests for plotting and new CLI validation/behavior tests. Updates dependency constraints in pyproject.toml and adjusts GitHub Actions to install the dev dependency group (and vulnerability scanning to sync all groups/extras with lowest-direct resolution).

Written by Cursor Bugbot for commit 3618c43. This will update automatically on new commits. Configure here.


Generated description

Below is a concise technical summary of the changes proposed in this PR:
Enable optional radar chart plotting from evaluate.main, validating Plotly dependencies, summarizing results, and optionally logging artifacts so the evaluation CLI stays resilient to plotting failures. Add reusable plotting helpers, unit tests, and dependency guidance so generated radar charts can be rendered and stored across environments.

TopicDetails
Radar plotting flow Enable optional radar chart generation after evaluation by wiring the CLI flags, dependency checks, summary parsing, and tolerant plot invocation into evaluate.main, while documenting the flow with new regression tests and a dedicated plotting.draw_radar_chart utility that renders Plotly radar visuals and supports MLflow artifact uploads.
Modified files (4)
  • llm_behavior_eval/evaluate.py
  • llm_behavior_eval/evaluation_utils/plotting.py
  • tests/test_evaluate_cli.py
  • tests/test_plotting.py
Latest Contributors(2)
UserCommitDate
eliran@hirundo.ioLLM-80: Upload results...March 24, 2026
orr@hirundo.ioLLM-76: Add option to ...March 02, 2026
Other Other files
Modified files (1)
  • uv.lock
Latest Contributors(2)
UserCommitDate
eliran@hirundo.ioLLM-80: Upload results...March 24, 2026
orr@hirundo.ioLLM-74: Upgrade vllm R...February 23, 2026
Dependency sync Align dependency metadata, developer guidance, and CI install scripts so the new plotly extra and broader dev toolchain are installed consistently across automation, documentation, and vulnerability scans.
Modified files (6)
  • .github/workflows/pyright.yaml
  • .github/workflows/ruff.yaml
  • .github/workflows/tests.yaml
  • .github/workflows/vulnerability-scan.yaml
  • AGENTS.md
  • pyproject.toml
Latest Contributors(2)
UserCommitDate
mishana4life@gmail.comLLM-64: Change results...February 16, 2026
blewis@hirundo.ioLLM-29: Add optional v...November 04, 2025
This pull request is reviewed by Baz. Review like a pro on (Baz).

Comment thread llm_behavior_eval/evaluate.py Outdated
Comment thread tests/test_plotting.py
@benglewis benglewis self-assigned this Feb 22, 2026
Comment thread .github/workflows/vulnerability-scan.yaml Outdated
Comment thread llm_behavior_eval/evaluate.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment thread llm_behavior_eval/evaluation_utils/plotting.py
@benglewis benglewis force-pushed the codex/2026-02-22/linear-mention-llm-72-add-charts-of-metrics-using-plotly branch from df5be71 to 3618c43 Compare February 22, 2026 15:47
@benglewis benglewis changed the title LLM-72 Remove unnecessary Plotly install from docs dependency group LLM-72: Add plot generation support with Plotly Feb 25, 2026
@benglewis benglewis changed the title LLM-72: Add plot generation support with Plotly LLM-72: Add radar chart plot generation support with Plotly Feb 25, 2026
@baz-reviewer
Copy link
Copy Markdown
Contributor

baz-reviewer Bot commented Mar 10, 2026

Spec Reviewer Report    📬

Checkout in Baz

1 / 2 requirements met for ticket:

Add charts of metrics using plotly


1 unmet requirement
# Requirement Explanation
1 Radar chart utility logs plots to MLflow artifacts - Not Met The new plotly helper simply writes HTML or image files and its signature/docstring (llm_behavior_eval/evaluation_utils/plotting.py:56-67) does not accept any MLflow-related flag or perform logging, so it cannot log artifacts directly without extra code elsewhere.
evidenceplotting.py:56-67 signature lacks MLflow logging flag plotting.py:131-136 writes HTML/image only, no MLflow interaction
1 met requirement
# Requirement Explanation
1 Plotly radar chart accepts custom categories and metric labels The new Plotly helper accepts arbitrary category lists/series and an optional metric-name mapping, and evaluate.py now feeds dataset names into the helper and uploads the saved HTML when MLflow logging is requested.
evidence
  • llm_behavior_eval/evaluation_utils/plotting.py:56-137 `draw_radar_chart` requires caller-provided categories, supports multiple `value_lists`, and honors the `metric_name_mapping`/title args for custom labels.
  • llm_behavior_eval/evaluate.py:66-127 `_plot_metrics_from_summary` extracts dataset names for categories, calls the Plotly helper for each metric, and conditionally logs the generated HTML to MLflow when `save_plot_to_mlflow` is true.

Note: Some optional integrations are missing, so it might not be possible to check some of the requirements.
For best results, make sure the following are integrated: Figma



Used resources:
Hash: 4226731 | Ticket: link

To rerun the Spec Reviewer, comment "baz rerun spec review".

@benglewis benglewis marked this pull request as ready for review March 11, 2026 00:51
…harts-of-metrics-using-plotly

# Conflicts:
#	pyproject.toml
#	uv.lock
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42267314cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread llm_behavior_eval/evaluate.py Outdated
Comment thread llm_behavior_eval/evaluate.py
Comment thread tests/test_evaluate_cli.py
Comment on lines +66 to +77
def _plot_metrics_from_summary(
result_dir: Path,
model_path_or_repo_id: str,
save_plot_to_mlflow: bool,
) -> None:
import pandas as pd

from llm_behavior_eval.evaluation_utils.plotting import draw_radar_chart

model_slug = model_path_or_repo_id.split("/")[-1]
summary_file_path = result_dir / model_slug / "summary_full.csv"
if not summary_file_path.exists():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_plot_metrics_from_summary derives the slug from model_path_or_repo_id and therefore misses summary_full.csv when EvaluationConfig writes to results_dir/model_slug resolved by get_model_slug(); should we derive the slug from the resolved model_output_dir by reusing EvaluationConfig/BaseEvaluator.get_model_slug() or by passing model_output_dir?

Finding type: Logical Bugs | Severity: 🔴 High


Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluate.py around lines 66-85, the _plot_metrics_from_summary
function builds the path to summary_full.csv using model_path_or_repo_id.split('/')[-1],
which misses cases where EvaluationConfig writes summaries to a different model_slug
(for example when --model-output-dir is provided or a LoRA suffix is appended). Change
the function to accept or compute the actual model_output_dir/model_slug used by
EvaluationConfig: either (a) accept a model_output_dir or model_slug parameter passed
from the caller (the resolved output dir), or (b) reuse
EvaluationConfig/BaseEvaluator.get_model_slug() to derive the slug from the resolved
output directory (including LoRA suffix). Update path resolution so summary_full.csv is
read from the exact summary directory, and add a fallback/error log if the file is not
found. Ensure callers are updated to pass the resolved model_output_dir if needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex Please fix this

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Testing

  • uv run ruff check llm_behavior_eval/evaluate.py tests/test_evaluate_cli.py
  • uv run basedpyright llm_behavior_eval/evaluate.py tests/test_evaluate_cli.py
  • uv run pytest tests/test_evaluate_cli.py -k "main_calls_plotting_when_enabled or main_keeps_success_when_plotting_fails or plot_metrics_from_summary_excludes_nan_values_per_metric or plot_metrics_from_summary_skips_when_dataset_column_missing"

Committed as a3dd10b, and I also created the follow-up PR via make_pr.

View task →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants