Skip to content

Regenerate benchmarks/results.json during Pages build so the benchmarks page shows real data #158

@mrjf

Description

@mrjf

Problem

https://silver-funicular-2qkwekw.pages.github.io/benchmarks.html renders "📊 No benchmark data available yet" even though benchmarks/ contains 600+ TS/Python benchmark pairs.

Root cause: benchmarks/results.json on main has been the 40-byte stub { "benchmarks": [], "timestamp": null } since 2026-04-12. .github/workflows/pages.yml copies that stub into playground/benchmarks/results.json, and benchmarks.html renders the "no data" message when the benchmarks array is empty. Nothing in CI runs benchmarks/run_benchmarks.sh to populate the file.

Why PR #154 didn't fix it

PR #154 moved benchmark execution into the autoloop program's Evaluation step, so every iteration was supposed to regenerate results.json and commit it. That fails in practice because the autoloop agent sandbox doesn't have bun and the curl | bash fallback doesn't produce a usable binary there. Every post-#154 iteration evaluates to metric = 0 → rejected → nothing commits → the perf-comparison autoloop is now stuck and can't ratchet at all (see run 24696210026 for the agent's own diagnosis).

Fix

Regenerate benchmarks/results.json during the Pages build — not in the autoloop. Pages already triggers on push to main, so any benchmark change auto-publishes fresh data. No new workflow, no commit-back-to-main, no autoloop-agent plumbing needed.

Change 1 — .github/workflows/pages.yml

Move the Python setup earlier, run the benchmark suite, then copy the regenerated results.json into the playground artifact. Drop the if [ -f ... ] guard since the file will always exist post-step.

       - name: Install dependencies
         run: bun install

       - name: Build library for browser
         run: bun build ./src/index.ts --outdir ./playground/dist --target browser --minify

       - name: Bundle TypeScript compiler for offline playground
         run: cp node_modules/typescript/lib/typescript.js ./playground/dist/typescript.js

+      - name: Setup Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install Python dependencies
+        run: pip install pandas numpy
+
+      - name: Run benchmarks
+        run: bash benchmarks/run_benchmarks.sh
+
       - name: Copy benchmark results to playground
         run: |
           mkdir -p ./playground/benchmarks
-          if [ -f benchmarks/results.json ]; then
-            cp benchmarks/results.json ./playground/benchmarks/results.json
-          fi
+          cp benchmarks/results.json ./playground/benchmarks/results.json

-      - name: Setup Python
-        uses: actions/setup-python@v5
-        with:
-          python-version: "3.12"
-
-      - name: Install Python dependencies
-        run: pip install pandas numpy
-
       - name: Validate Python playground examples
         run: python scripts/validate-python-examples.py playground/

Change 2 — revert PR #154's evaluation changes in .autoloop/programs/perf-comparison/program.md

Restore the pre-#154 metric (file-count based) so the autoloop can ratchet again. Execution belongs in the Pages workflow, not the agent sandbox. Keep the per-iteration checklist changes from #154 that are orthogonal (e.g., any prompt clarifications); just revert the Evaluation section.

Trade-offs

  • Pages builds get slower. 600 pairs × (warmup + measured iterations) — plausibly 10–30 min per build. Runs only on push to main, so it doesn't block PRs. If it becomes painful, split with a matrix later.
  • Broken benchmarks surface late (on merge to main) instead of in-PR. Acceptable because the autoloop still validates that benchmark files are syntactically valid Python/TS before landing.

Acceptance

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions