AORTA-23 : Phase 8 : Added a summary in generated HTML #105
AORTA-23 : Phase 8 : Added a summary in generated HTML #105prosenjitdhole wants to merge 5 commits intoprosenj_cli_hq_eval_report_phase_7from
Conversation
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Check latency trend (P99 at peak vs P99 at lowest stream count) | ||
| sorted_results = sorted(data.results, key=lambda r: r.stream_count) | ||
| if sorted_results: | ||
| p99_at_1 = sorted_results[0].latency.p99 | ||
| p99_at_peak = data.results[peak_idx].latency.p99 if peak_idx < len(data.results) else p99_at_1 |
There was a problem hiding this comment.
The comment says latency_ratio is "P99 at peak / P99 at 1 stream", but the code uses the lowest available stream count (sorted_results[0]) rather than explicitly using stream_count==1. Either update the wording/variable naming to reflect "lowest stream count" or explicitly select the 1-stream result when present to avoid misleading summaries.
| return f""" | ||
| <div class="verdict-box verdict-{summary.verdict_class}"> | ||
| <h2>📊 Comparison Summary: {baseline_label} → {test_label}</h2> | ||
| <div class="verdict-headline {summary.verdict_class}">{summary.verdict}</div> | ||
| {stats_html} | ||
| {summary_table} | ||
| </div> |
There was a problem hiding this comment.
The new verdict HTML interpolates baseline_label, test_label, and workload names (summary.top_improvement[0] / summary.top_regression[0]) directly into HTML. If these values can contain <, &, quotes, etc. (e.g., derived from filenames or user-provided labels), the generated report becomes vulnerable to HTML injection. Escape dynamic text with html.escape(...) (or a centralized escaping helper) before embedding it into the HTML string.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def __post_init__(self) -> None: | ||
| """Normalize status fields based on verdicts if not explicitly set.""" | ||
| valid_statuses = {"good", "warning", "poor"} | ||
|
|
||
| if self.scaling_status not in valid_statuses: |
There was a problem hiding this comment.
SweepSummary.__post_init__() is intended to infer scaling_status/latency_status from the verdict text when they aren't explicitly provided, but the defaults are set to "good" (which is a valid status). As a result the inference branches never run and get_overall_status() will always return "good" unless a caller manually overrides the status fields. Consider defaulting these fields to None/"" (or always inferring from verdicts and only overriding when an explicit non-empty status is passed).
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| peak_idx = throughputs.index(max(throughputs)) | ||
| peak_throughput = throughputs[peak_idx] | ||
| peak_streams = stream_counts[peak_idx] | ||
| peak_efficiency = efficiencies[peak_idx] if peak_idx < len(efficiencies) else 0 |
There was a problem hiding this comment.
peak_idx is derived from the throughputs list (which may come from data.analysis), but later logic uses that index against stream_counts and data.results (e.g., to derive peak_streams and p99_at_peak). If the analysis arrays are missing/misaligned or ordered differently than data.results, this can raise IndexError or compute the wrong peak latency. Safer approach: derive peak_stream_count from stream_counts[peak_idx] (after validating lengths) and then fetch the corresponding run via data.get_result_by_stream_count(...) (or compute peak from data.results directly).
| throughputs = [r.throughput for r in data.results] | ||
| stream_counts = [r.stream_count for r in data.results] | ||
| # Compute efficiency (actual / ideal where ideal = single_stream * n) | ||
| base_throughput = throughputs[0] if throughputs else 1 | ||
| efficiencies = [ | ||
| t / (base_throughput * s) if base_throughput > 0 and s > 0 else 0 | ||
| for t, s in zip(throughputs, stream_counts) | ||
| ] |
There was a problem hiding this comment.
When computing efficiencies from data.results, base_throughput = throughputs[0] assumes the results are already ordered by increasing stream count and that the first entry represents the baseline. SweepData.from_dict() does not sort results, so efficiency (and downstream optimal-stream selection) can be incorrect if JSON ordering changes. Consider sorting by stream_count (or explicitly selecting the lowest stream count) before choosing the baseline throughput.
| optimal_idx = 0 | ||
| for i, (eff, tput, sc) in enumerate(zip(efficiencies, throughputs, stream_counts)): | ||
| if eff >= 0.70 and tput >= 0.80 * peak_throughput: | ||
| optimal_idx = i | ||
| optimal_streams = stream_counts[optimal_idx] if stream_counts else peak_streams | ||
|
|
There was a problem hiding this comment.
optimal_idx is updated based on iteration order over zip(efficiencies, throughputs, stream_counts). If those arrays are not sorted by stream count, the chosen optimal_streams may not be the highest stream count meeting the criteria (as the comment states). Consider selecting candidates that meet the thresholds and then choosing max(stream_count) (or explicitly iterating in sorted stream-count order).
| optimal_idx = 0 | |
| for i, (eff, tput, sc) in enumerate(zip(efficiencies, throughputs, stream_counts)): | |
| if eff >= 0.70 and tput >= 0.80 * peak_throughput: | |
| optimal_idx = i | |
| optimal_streams = stream_counts[optimal_idx] if stream_counts else peak_streams | |
| optimal_idx: Optional[int] = None | |
| optimal_streams = peak_streams | |
| for i, (eff, tput, sc) in enumerate(zip(efficiencies, throughputs, stream_counts)): | |
| if eff >= 0.70 and tput >= 0.80 * peak_throughput: | |
| # Choose the qualifying configuration with the highest stream count, | |
| # independent of the iteration order of the input arrays. | |
| if optimal_idx is None or sc > optimal_streams: | |
| optimal_idx = i | |
| optimal_streams = sc | |
| # If no configuration met the thresholds, fall back to peak_streams | |
| if optimal_idx is None: | |
| optimal_streams = peak_streams |
| num_improved = len(workloads_with_improvement - workloads_with_regression) | ||
| num_degraded = len(workloads_with_regression - workloads_with_improvement) | ||
| # Workloads with both are counted as mixed - we'll count them as degraded for simplicity | ||
| num_mixed = len(workloads_with_regression & workloads_with_improvement) | ||
| num_degraded += num_mixed | ||
| num_unchanged = total - num_improved - num_degraded |
There was a problem hiding this comment.
Mixed workloads (those appearing in both regressions and improvements) are currently added into num_degraded, which makes the displayed counts (and the overall verdict thresholds) misleading: a workload can have both latency regressions and throughput improvements, but it will be reported as "degraded" only. Consider tracking num_mixed separately (and optionally exposing it in ComparisonSummary/HTML) so the summary matches the underlying data.
| num_improved = len(workloads_with_improvement - workloads_with_regression) | |
| num_degraded = len(workloads_with_regression - workloads_with_improvement) | |
| # Workloads with both are counted as mixed - we'll count them as degraded for simplicity | |
| num_mixed = len(workloads_with_regression & workloads_with_improvement) | |
| num_degraded += num_mixed | |
| num_unchanged = total - num_improved - num_degraded | |
| # Disjoint categories: | |
| # - improved: workloads with improvements but no regressions | |
| # - degraded: workloads with regressions but no improvements | |
| # - mixed: workloads that appear in both sets | |
| num_improved = len(workloads_with_improvement - workloads_with_regression) | |
| num_degraded = len(workloads_with_regression - workloads_with_improvement) | |
| num_mixed = len(workloads_with_regression & workloads_with_improvement) | |
| num_unchanged = total - num_improved - num_degraded - num_mixed |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@prosenjitdhole I've opened a new pull request, #121, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Determine scaling verdict | ||
| if peak_efficiency >= 0.80: | ||
| scaling_verdict = ("✓ Excellent", "Near-linear scaling") | ||
| elif peak_efficiency >= 0.60: | ||
| scaling_verdict = ("✓ Good", "Good scaling with some overhead") |
There was a problem hiding this comment.
The worst scaling verdict text is labeled "⚠ Poor", but the status-mapping logic treats "⚠" as a warning (and "✗" as poor). If you want this case to render as a poor/failed status, the verdict marker should align with the mapping (e.g., use "✗"), or set scaling_status explicitly for this branch.
| return f""" | ||
| <div class="verdict-box verdict-{status_class}"> | ||
| <h2>📊 Analysis Summary</h2> | ||
| <table class="verdict-table"> | ||
| <tr> | ||
| <td>Peak Performance</td> | ||
| <td class="good">✓ {summary.peak_throughput:.0f} {summary.throughput_unit} at {summary.peak_streams} streams</td> |
There was a problem hiding this comment.
The Peak Performance row is always rendered with a green "good" class and a "✓" marker. When generate_sweep_summary returns an "Unknown/No data" summary (e.g., empty results), this will still display as a success ("✓ 0 ops/sec at 0 streams"). Consider deriving this row’s marker/class from the summary status (or suppressing the verdict box when peak_streams == 0).
| return f""" | |
| <div class="verdict-box verdict-{status_class}"> | |
| <h2>📊 Analysis Summary</h2> | |
| <table class="verdict-table"> | |
| <tr> | |
| <td>Peak Performance</td> | |
| <td class="good">✓ {summary.peak_throughput:.0f} {summary.throughput_unit} at {summary.peak_streams} streams</td> | |
| # Determine marker and CSS class for the Peak Performance row. | |
| # This avoids showing a green "✓" when there is no data or a non-good status. | |
| if summary.peak_streams == 0: | |
| peak_marker = "⚠" | |
| peak_class = "warning" | |
| else: | |
| if status_class == "good": | |
| peak_marker = "✓" | |
| peak_class = "good" | |
| elif status_class == "warning": | |
| peak_marker = "⚠" | |
| peak_class = "warning" | |
| else: | |
| peak_marker = "✗" | |
| peak_class = "poor" | |
| return f""" | |
| <div class="verdict-box verdict-{status_class}"> | |
| <h2>📊 Analysis Summary</h2> | |
| <table class="verdict-table"> | |
| <tr> | |
| <td>Peak Performance</td> | |
| <td class="{peak_class}">{peak_marker} {summary.peak_throughput:.0f} {summary.throughput_unit} at {summary.peak_streams} streams</td> |
| # Normalized status fields used for logic/CSS: "good", "warning", "poor" | ||
| scaling_status: str = "good" | ||
| latency_status: str = "good" | ||
|
|
||
| def __post_init__(self) -> None: | ||
| """Normalize status fields based on verdicts if not explicitly set.""" | ||
| valid_statuses = {"good", "warning", "poor"} | ||
|
|
||
| if self.scaling_status not in valid_statuses: | ||
| self.scaling_status = self._infer_status_from_verdict_text( | ||
| self.scaling_verdict[0] |
There was a problem hiding this comment.
SweepSummary.__post_init__ never infers scaling_status/latency_status because both fields default to the valid value "good". As a result get_overall_status() will always return "good" and the verdict box will always render as green even when verdicts indicate warnings/regressions. Consider defaulting these fields to None/"auto" (or field(init=False)) and inferring unconditionally unless explicitly provided.
| # Compute efficiency (actual / ideal where ideal uses normalized single-stream throughput) | ||
| base_throughput = throughputs[0] if throughputs else 1 | ||
| baseline_streams = stream_counts[0] if stream_counts else 1 | ||
| if base_throughput > 0 and baseline_streams > 0: | ||
| single_stream_throughput = base_throughput / baseline_streams | ||
| else: | ||
| single_stream_throughput = 0 | ||
| efficiencies = [ | ||
| (t / (single_stream_throughput * s)) if single_stream_throughput > 0 and s > 0 else 0 | ||
| for t, s in zip(throughputs, stream_counts) | ||
| ] |
There was a problem hiding this comment.
Efficiency computation uses throughputs[0]/stream_counts[0] as the baseline, but data.results order isn't guaranteed (and may not start at the lowest stream count / 1 stream). This can produce incorrect efficiencies and therefore incorrect scaling/optimal-streams verdicts. Consider selecting the baseline run by the minimum stream count (prefer stream_count==1 if present) and using that throughput to derive single_stream_throughput.
|
@prosenjitdhole I've opened a new pull request, #122, to work on those changes. Once the pull request is ready, I'll request review from you. |
No description provided.