LLM-72: Add radar chart plot generation support with Plotly#105
LLM-72: Add radar chart plot generation support with Plotly#105benglewis wants to merge 14 commits into
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
…rability scanning with `--resolution lowest-direct` works nicely
df5be71 to
3618c43
Compare
# Conflicts: # AGENTS.md # pyproject.toml # uv.lock
Spec Reviewer Report 📬1 / 2 requirements met for ticket:
1 unmet requirement
1 met requirement
Used resources: |
…harts-of-metrics-using-plotly # Conflicts: # pyproject.toml # uv.lock
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42267314cc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| def _plot_metrics_from_summary( | ||
| result_dir: Path, | ||
| model_path_or_repo_id: str, | ||
| save_plot_to_mlflow: bool, | ||
| ) -> None: | ||
| import pandas as pd | ||
|
|
||
| from llm_behavior_eval.evaluation_utils.plotting import draw_radar_chart | ||
|
|
||
| model_slug = model_path_or_repo_id.split("/")[-1] | ||
| summary_file_path = result_dir / model_slug / "summary_full.csv" | ||
| if not summary_file_path.exists(): |
There was a problem hiding this comment.
_plot_metrics_from_summary derives the slug from model_path_or_repo_id and therefore misses summary_full.csv when EvaluationConfig writes to results_dir/model_slug resolved by get_model_slug(); should we derive the slug from the resolved model_output_dir by reusing EvaluationConfig/BaseEvaluator.get_model_slug() or by passing model_output_dir?
Finding type: Logical Bugs | Severity: 🔴 High
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
llm_behavior_eval/evaluate.py around lines 66-85, the _plot_metrics_from_summary
function builds the path to summary_full.csv using model_path_or_repo_id.split('/')[-1],
which misses cases where EvaluationConfig writes summaries to a different model_slug
(for example when --model-output-dir is provided or a LoRA suffix is appended). Change
the function to accept or compute the actual model_output_dir/model_slug used by
EvaluationConfig: either (a) accept a model_output_dir or model_slug parameter passed
from the caller (the resolved output dir), or (b) reuse
EvaluationConfig/BaseEvaluator.get_model_slug() to derive the slug from the resolved
output directory (including LoRA suffix). Update path resolution so summary_full.csv is
read from the exact summary directory, and add a fallback/error log if the file is not
found. Ensure callers are updated to pass the resolved model_output_dir if needed.
There was a problem hiding this comment.
Summary
- Fixed
_plot_metrics_from_summaryto consume a resolvedmodel_slugdirectly, so it readssummary_full.csvfrom the actual model output folder instead of re-deriving frommodel_path_or_repo_id. llm_behavior_eval/evaluate.pyL66-L76 - Updated the plotting call in
main()to passevaluator.get_model_slug(), aligning plot artifact lookup with evaluator output path logic (including custommodel_output_dir/ LoRA-derived slugs). llm_behavior_eval/evaluate.pyL679-L685 - Updated plotting tests to the new function signature and added
get_model_slug()to the stub evaluator so the plotting path call is validated correctly in CLI tests. tests/test_evaluate_cli.pyL24-L27 tests/test_evaluate_cli.pyL528-L549 tests/test_evaluate_cli.pyL632-L636 tests/test_evaluate_cli.pyL707-L711
Testing
- ✅
uv run ruff check llm_behavior_eval/evaluate.py tests/test_evaluate_cli.py - ✅
uv run basedpyright llm_behavior_eval/evaluate.py tests/test_evaluate_cli.py - ✅
uv run pytest tests/test_evaluate_cli.py -k "main_calls_plotting_when_enabled or main_keeps_success_when_plotting_fails or plot_metrics_from_summary_excludes_nan_values_per_metric or plot_metrics_from_summary_skips_when_dataset_column_missing"
Committed as a3dd10b, and I also created the follow-up PR via make_pr.
User description
Codex Task
Note
Medium Risk
Adds new optional runtime behavior and dependencies (Plotly/MLflow artifact logging) and changes CI install/resolution behavior, which could affect environments and dependency resolution despite limited impact on the core evaluation path.
Overview
Adds optional radar-chart generation to the
llm-behavior-evalCLI via--plot(and--save-plot-to-mlflowwhen--use-mlflowis enabled), generating Plotly HTML artifacts fromsummary_full.csvafter a successful evaluation and tolerating plotting failures without failing the run.Introduces a new
plotlyextra andevaluation_utils/plotting.pyhelper (draw_radar_chart) plus unit tests for plotting and new CLI validation/behavior tests. Updates dependency constraints inpyproject.tomland adjusts GitHub Actions to install thedevdependency group (and vulnerability scanning to sync all groups/extras with lowest-direct resolution).Written by Cursor Bugbot for commit 3618c43. This will update automatically on new commits. Configure here.
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Enable optional radar chart plotting from
evaluate.main, validating Plotly dependencies, summarizing results, and optionally logging artifacts so the evaluation CLI stays resilient to plotting failures. Add reusable plotting helpers, unit tests, and dependency guidance so generated radar charts can be rendered and stored across environments.evaluate.main, while documenting the flow with new regression tests and a dedicatedplotting.draw_radar_chartutility that renders Plotly radar visuals and supports MLflow artifact uploads.Modified files (4)
Latest Contributors(2)
Modified files (1)
Latest Contributors(2)
plotlyextra and broader dev toolchain are installed consistently across automation, documentation, and vulnerability scans.Modified files (6)
Latest Contributors(2)