Problem
https://silver-funicular-2qkwekw.pages.github.io/benchmarks.html renders "📊 No benchmark data available yet" even though benchmarks/ contains 600+ TS/Python benchmark pairs.
Root cause: benchmarks/results.json on main has been the 40-byte stub { "benchmarks": [], "timestamp": null } since 2026-04-12. .github/workflows/pages.yml copies that stub into playground/benchmarks/results.json, and benchmarks.html renders the "no data" message when the benchmarks array is empty. Nothing in CI runs benchmarks/run_benchmarks.sh to populate the file.
Why PR #154 didn't fix it
PR #154 moved benchmark execution into the autoloop program's Evaluation step, so every iteration was supposed to regenerate results.json and commit it. That fails in practice because the autoloop agent sandbox doesn't have bun and the curl | bash fallback doesn't produce a usable binary there. Every post-#154 iteration evaluates to metric = 0 → rejected → nothing commits → the perf-comparison autoloop is now stuck and can't ratchet at all (see run 24696210026 for the agent's own diagnosis).
Fix
Regenerate benchmarks/results.json during the Pages build — not in the autoloop. Pages already triggers on push to main, so any benchmark change auto-publishes fresh data. No new workflow, no commit-back-to-main, no autoloop-agent plumbing needed.
Change 1 — .github/workflows/pages.yml
Move the Python setup earlier, run the benchmark suite, then copy the regenerated results.json into the playground artifact. Drop the if [ -f ... ] guard since the file will always exist post-step.
- name: Install dependencies
run: bun install
- name: Build library for browser
run: bun build ./src/index.ts --outdir ./playground/dist --target browser --minify
- name: Bundle TypeScript compiler for offline playground
run: cp node_modules/typescript/lib/typescript.js ./playground/dist/typescript.js
+ - name: Setup Python
+ uses: actions/setup-python@v5
+ with:
+ python-version: "3.12"
+
+ - name: Install Python dependencies
+ run: pip install pandas numpy
+
+ - name: Run benchmarks
+ run: bash benchmarks/run_benchmarks.sh
+
- name: Copy benchmark results to playground
run: |
mkdir -p ./playground/benchmarks
- if [ -f benchmarks/results.json ]; then
- cp benchmarks/results.json ./playground/benchmarks/results.json
- fi
+ cp benchmarks/results.json ./playground/benchmarks/results.json
- - name: Setup Python
- uses: actions/setup-python@v5
- with:
- python-version: "3.12"
-
- - name: Install Python dependencies
- run: pip install pandas numpy
-
- name: Validate Python playground examples
run: python scripts/validate-python-examples.py playground/
Change 2 — revert PR #154's evaluation changes in .autoloop/programs/perf-comparison/program.md
Restore the pre-#154 metric (file-count based) so the autoloop can ratchet again. Execution belongs in the Pages workflow, not the agent sandbox. Keep the per-iteration checklist changes from #154 that are orthogonal (e.g., any prompt clarifications); just revert the Evaluation section.
Trade-offs
- Pages builds get slower. 600 pairs × (warmup + measured iterations) — plausibly 10–30 min per build. Runs only on push to
main, so it doesn't block PRs. If it becomes painful, split with a matrix later.
- Broken benchmarks surface late (on merge to
main) instead of in-PR. Acceptable because the autoloop still validates that benchmark files are syntactically valid Python/TS before landing.
Acceptance
Problem
https://silver-funicular-2qkwekw.pages.github.io/benchmarks.htmlrenders "📊 No benchmark data available yet" even thoughbenchmarks/contains 600+ TS/Python benchmark pairs.Root cause:
benchmarks/results.jsononmainhas been the 40-byte stub{ "benchmarks": [], "timestamp": null }since 2026-04-12..github/workflows/pages.ymlcopies that stub intoplayground/benchmarks/results.json, andbenchmarks.htmlrenders the "no data" message when thebenchmarksarray is empty. Nothing in CI runsbenchmarks/run_benchmarks.shto populate the file.Why PR #154 didn't fix it
PR #154 moved benchmark execution into the autoloop program's Evaluation step, so every iteration was supposed to regenerate
results.jsonand commit it. That fails in practice because the autoloop agent sandbox doesn't havebunand thecurl | bashfallback doesn't produce a usable binary there. Every post-#154 iteration evaluates to metric = 0 → rejected → nothing commits → the perf-comparison autoloop is now stuck and can't ratchet at all (see run 24696210026 for the agent's own diagnosis).Fix
Regenerate
benchmarks/results.jsonduring the Pages build — not in the autoloop. Pages already triggers on push tomain, so any benchmark change auto-publishes fresh data. No new workflow, no commit-back-to-main, no autoloop-agent plumbing needed.Change 1 —
.github/workflows/pages.ymlMove the Python setup earlier, run the benchmark suite, then copy the regenerated
results.jsoninto the playground artifact. Drop theif [ -f ... ]guard since the file will always exist post-step.- name: Install dependencies run: bun install - name: Build library for browser run: bun build ./src/index.ts --outdir ./playground/dist --target browser --minify - name: Bundle TypeScript compiler for offline playground run: cp node_modules/typescript/lib/typescript.js ./playground/dist/typescript.js + - name: Setup Python + uses: actions/setup-python@v5 + with: + python-version: "3.12" + + - name: Install Python dependencies + run: pip install pandas numpy + + - name: Run benchmarks + run: bash benchmarks/run_benchmarks.sh + - name: Copy benchmark results to playground run: | mkdir -p ./playground/benchmarks - if [ -f benchmarks/results.json ]; then - cp benchmarks/results.json ./playground/benchmarks/results.json - fi + cp benchmarks/results.json ./playground/benchmarks/results.json - - name: Setup Python - uses: actions/setup-python@v5 - with: - python-version: "3.12" - - - name: Install Python dependencies - run: pip install pandas numpy - - name: Validate Python playground examples run: python scripts/validate-python-examples.py playground/Change 2 — revert PR #154's evaluation changes in
.autoloop/programs/perf-comparison/program.mdRestore the pre-#154 metric (file-count based) so the autoloop can ratchet again. Execution belongs in the Pages workflow, not the agent sandbox. Keep the per-iteration checklist changes from #154 that are orthogonal (e.g., any prompt clarifications); just revert the Evaluation section.
Trade-offs
main, so it doesn't block PRs. If it becomes painful, split with a matrix later.main) instead of in-PR. Acceptable because the autoloop still validates that benchmark files are syntactically valid Python/TS before landing.Acceptance
benchmarks.htmlon the Pages site shows a populated benchmarks table after a merge tomain.mainfrom CI.