Skip to content

AORTA-23 : Phase 8 : Added a summary in generated HTML #105

Open
prosenjitdhole wants to merge 5 commits intoprosenj_cli_hq_eval_report_phase_7from
prosenj_cli_hq_eval_report_phase_8
Open

AORTA-23 : Phase 8 : Added a summary in generated HTML #105
prosenjitdhole wants to merge 5 commits intoprosenj_cli_hq_eval_report_phase_7from
prosenj_cli_hq_eval_report_phase_8

Conversation

@prosenjitdhole
Copy link
Copy Markdown
Collaborator

No description provided.

@prosenjitdhole prosenjitdhole changed the base branch from main to prosenj_cli_hq_eval_report_phase_7 February 16, 2026 11:17
@prosenjitdhole prosenjitdhole changed the title [WIP] AORTA-23 : Phase 8 : Added a summary in generated HTML AORTA-23 : Phase 8 : Added a summary in generated HTML Feb 18, 2026
@prosenjitdhole prosenjitdhole marked this pull request as ready for review February 18, 2026 14:18
@amd-vivekag amd-vivekag requested a review from Copilot February 19, 2026 09:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/aorta/report/pipelines/hwqueue_pipeline.py Outdated
Comment thread src/aorta/report/pipelines/hwqueue_pipeline.py Outdated
Comment thread src/aorta/report/pipelines/hwqueue_pipeline.py Outdated
Comment on lines +142 to +146
# Check latency trend (P99 at peak vs P99 at lowest stream count)
sorted_results = sorted(data.results, key=lambda r: r.stream_count)
if sorted_results:
p99_at_1 = sorted_results[0].latency.p99
p99_at_peak = data.results[peak_idx].latency.p99 if peak_idx < len(data.results) else p99_at_1
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says latency_ratio is "P99 at peak / P99 at 1 stream", but the code uses the lowest available stream count (sorted_results[0]) rather than explicitly using stream_count==1. Either update the wording/variable naming to reflect "lowest stream count" or explicitly select the 1-stream result when present to avoid misleading summaries.

Copilot uses AI. Check for mistakes.
Comment on lines +169 to +175
return f"""
<div class="verdict-box verdict-{summary.verdict_class}">
<h2>📊 Comparison Summary: {baseline_label} → {test_label}</h2>
<div class="verdict-headline {summary.verdict_class}">{summary.verdict}</div>
{stats_html}
{summary_table}
</div>
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new verdict HTML interpolates baseline_label, test_label, and workload names (summary.top_improvement[0] / summary.top_regression[0]) directly into HTML. If these values can contain <, &, quotes, etc. (e.g., derived from filenames or user-provided labels), the generated report becomes vulnerable to HTML injection. Escape dynamic text with html.escape(...) (or a centralized escaping helper) before embedding it into the HTML string.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 6, 2026 10:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/aorta/report/pipelines/hwqueue_pipeline.py Outdated
Comment on lines +49 to +53
def __post_init__(self) -> None:
"""Normalize status fields based on verdicts if not explicitly set."""
valid_statuses = {"good", "warning", "poor"}

if self.scaling_status not in valid_statuses:
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SweepSummary.__post_init__() is intended to infer scaling_status/latency_status from the verdict text when they aren't explicitly provided, but the defaults are set to "good" (which is a valid status). As a result the inference branches never run and get_overall_status() will always return "good" unless a caller manually overrides the status fields. Consider defaulting these fields to None/"" (or always inferring from verdicts and only overriding when an explicit non-empty status is passed).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines +154 to +157
peak_idx = throughputs.index(max(throughputs))
peak_throughput = throughputs[peak_idx]
peak_streams = stream_counts[peak_idx]
peak_efficiency = efficiencies[peak_idx] if peak_idx < len(efficiencies) else 0
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peak_idx is derived from the throughputs list (which may come from data.analysis), but later logic uses that index against stream_counts and data.results (e.g., to derive peak_streams and p99_at_peak). If the analysis arrays are missing/misaligned or ordered differently than data.results, this can raise IndexError or compute the wrong peak latency. Safer approach: derive peak_stream_count from stream_counts[peak_idx] (after validating lengths) and then fetch the corresponding run via data.get_result_by_stream_count(...) (or compute peak from data.results directly).

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +138
throughputs = [r.throughput for r in data.results]
stream_counts = [r.stream_count for r in data.results]
# Compute efficiency (actual / ideal where ideal = single_stream * n)
base_throughput = throughputs[0] if throughputs else 1
efficiencies = [
t / (base_throughput * s) if base_throughput > 0 and s > 0 else 0
for t, s in zip(throughputs, stream_counts)
]
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When computing efficiencies from data.results, base_throughput = throughputs[0] assumes the results are already ordered by increasing stream count and that the first entry represents the baseline. SweepData.from_dict() does not sort results, so efficiency (and downstream optimal-stream selection) can be incorrect if JSON ordering changes. Consider sorting by stream_count (or explicitly selecting the lowest stream count) before choosing the baseline throughput.

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +168
optimal_idx = 0
for i, (eff, tput, sc) in enumerate(zip(efficiencies, throughputs, stream_counts)):
if eff >= 0.70 and tput >= 0.80 * peak_throughput:
optimal_idx = i
optimal_streams = stream_counts[optimal_idx] if stream_counts else peak_streams

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimal_idx is updated based on iteration order over zip(efficiencies, throughputs, stream_counts). If those arrays are not sorted by stream count, the chosen optimal_streams may not be the highest stream count meeting the criteria (as the comment states). Consider selecting candidates that meet the thresholds and then choosing max(stream_count) (or explicitly iterating in sorted stream-count order).

Suggested change
optimal_idx = 0
for i, (eff, tput, sc) in enumerate(zip(efficiencies, throughputs, stream_counts)):
if eff >= 0.70 and tput >= 0.80 * peak_throughput:
optimal_idx = i
optimal_streams = stream_counts[optimal_idx] if stream_counts else peak_streams
optimal_idx: Optional[int] = None
optimal_streams = peak_streams
for i, (eff, tput, sc) in enumerate(zip(efficiencies, throughputs, stream_counts)):
if eff >= 0.70 and tput >= 0.80 * peak_throughput:
# Choose the qualifying configuration with the highest stream count,
# independent of the iteration order of the input arrays.
if optimal_idx is None or sc > optimal_streams:
optimal_idx = i
optimal_streams = sc
# If no configuration met the thresholds, fall back to peak_streams
if optimal_idx is None:
optimal_streams = peak_streams

Copilot uses AI. Check for mistakes.
Comment on lines +236 to +241
num_improved = len(workloads_with_improvement - workloads_with_regression)
num_degraded = len(workloads_with_regression - workloads_with_improvement)
# Workloads with both are counted as mixed - we'll count them as degraded for simplicity
num_mixed = len(workloads_with_regression & workloads_with_improvement)
num_degraded += num_mixed
num_unchanged = total - num_improved - num_degraded
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed workloads (those appearing in both regressions and improvements) are currently added into num_degraded, which makes the displayed counts (and the overall verdict thresholds) misleading: a workload can have both latency regressions and throughput improvements, but it will be reported as "degraded" only. Consider tracking num_mixed separately (and optionally exposing it in ComparisonSummary/HTML) so the summary matches the underlying data.

Suggested change
num_improved = len(workloads_with_improvement - workloads_with_regression)
num_degraded = len(workloads_with_regression - workloads_with_improvement)
# Workloads with both are counted as mixed - we'll count them as degraded for simplicity
num_mixed = len(workloads_with_regression & workloads_with_improvement)
num_degraded += num_mixed
num_unchanged = total - num_improved - num_degraded
# Disjoint categories:
# - improved: workloads with improvements but no regressions
# - degraded: workloads with regressions but no improvements
# - mixed: workloads that appear in both sets
num_improved = len(workloads_with_improvement - workloads_with_regression)
num_degraded = len(workloads_with_regression - workloads_with_improvement)
num_mixed = len(workloads_with_regression & workloads_with_improvement)
num_unchanged = total - num_improved - num_degraded - num_mixed

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 6, 2026 10:52
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 6, 2026

@prosenjitdhole I've opened a new pull request, #121, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +188 to +192
# Determine scaling verdict
if peak_efficiency >= 0.80:
scaling_verdict = ("✓ Excellent", "Near-linear scaling")
elif peak_efficiency >= 0.60:
scaling_verdict = ("✓ Good", "Good scaling with some overhead")
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst scaling verdict text is labeled "⚠ Poor", but the status-mapping logic treats "⚠" as a warning (and "✗" as poor). If you want this case to render as a poor/failed status, the verdict marker should align with the mapping (e.g., use "✗"), or set scaling_status explicitly for this branch.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +100
return f"""
<div class="verdict-box verdict-{status_class}">
<h2>📊 Analysis Summary</h2>
<table class="verdict-table">
<tr>
<td>Peak Performance</td>
<td class="good">✓ {summary.peak_throughput:.0f} {summary.throughput_unit} at {summary.peak_streams} streams</td>
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Peak Performance row is always rendered with a green "good" class and a "✓" marker. When generate_sweep_summary returns an "Unknown/No data" summary (e.g., empty results), this will still display as a success ("✓ 0 ops/sec at 0 streams"). Consider deriving this row’s marker/class from the summary status (or suppressing the verdict box when peak_streams == 0).

Suggested change
return f"""
<div class="verdict-box verdict-{status_class}">
<h2>📊 Analysis Summary</h2>
<table class="verdict-table">
<tr>
<td>Peak Performance</td>
<td class="good">✓ {summary.peak_throughput:.0f} {summary.throughput_unit} at {summary.peak_streams} streams</td>
# Determine marker and CSS class for the Peak Performance row.
# This avoids showing a green "✓" when there is no data or a non-good status.
if summary.peak_streams == 0:
peak_marker = "⚠"
peak_class = "warning"
else:
if status_class == "good":
peak_marker = "✓"
peak_class = "good"
elif status_class == "warning":
peak_marker = "⚠"
peak_class = "warning"
else:
peak_marker = "✗"
peak_class = "poor"
return f"""
<div class="verdict-box verdict-{status_class}">
<h2>📊 Analysis Summary</h2>
<table class="verdict-table">
<tr>
<td>Peak Performance</td>
<td class="{peak_class}">{peak_marker} {summary.peak_throughput:.0f} {summary.throughput_unit} at {summary.peak_streams} streams</td>

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +55
# Normalized status fields used for logic/CSS: "good", "warning", "poor"
scaling_status: str = "good"
latency_status: str = "good"

def __post_init__(self) -> None:
"""Normalize status fields based on verdicts if not explicitly set."""
valid_statuses = {"good", "warning", "poor"}

if self.scaling_status not in valid_statuses:
self.scaling_status = self._infer_status_from_verdict_text(
self.scaling_verdict[0]
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SweepSummary.__post_init__ never infers scaling_status/latency_status because both fields default to the valid value "good". As a result get_overall_status() will always return "good" and the verdict box will always render as green even when verdicts indicate warnings/regressions. Consider defaulting these fields to None/"auto" (or field(init=False)) and inferring unconditionally unless explicitly provided.

Copilot uses AI. Check for mistakes.
Comment on lines +133 to +143
# Compute efficiency (actual / ideal where ideal uses normalized single-stream throughput)
base_throughput = throughputs[0] if throughputs else 1
baseline_streams = stream_counts[0] if stream_counts else 1
if base_throughput > 0 and baseline_streams > 0:
single_stream_throughput = base_throughput / baseline_streams
else:
single_stream_throughput = 0
efficiencies = [
(t / (single_stream_throughput * s)) if single_stream_throughput > 0 and s > 0 else 0
for t, s in zip(throughputs, stream_counts)
]
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Efficiency computation uses throughputs[0]/stream_counts[0] as the baseline, but data.results order isn't guaranteed (and may not start at the lowest stream count / 1 stream). This can produce incorrect efficiencies and therefore incorrect scaling/optimal-streams verdicts. Consider selecting the baseline run by the minimum stream count (prefer stream_count==1 if present) and using that throughput to derive single_stream_throughput.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 6, 2026

@prosenjitdhole I've opened a new pull request, #122, to work on those changes. Once the pull request is ready, I'll request review from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants