githubnext · mrjf · Apr 21, 2026 · Apr 21, 2026 · Apr 21, 2026
diff --git a/.autoloop/programs/perf-comparison/program.md b/.autoloop/programs/perf-comparison/program.md
@@ -24,7 +24,7 @@ This is an open-ended program — it runs continuously, always adding the next b
    - Outputs the same JSON format
 5. **Update `playground/benchmarks.html`** if needed to display the new function's comparison metrics.
 
-The evaluation step (below) runs `benchmarks/run_benchmarks.sh` to execute **every** TS/Python benchmark pair and regenerates `benchmarks/results.json` with the real timing data. That regenerated file is what gets committed on a successful iteration, so when the autoloop branch is merged to `main`, the pages workflow (`.github/workflows/pages.yml`) picks up the real results and `playground/benchmarks.html` renders real comparison data instead of "No benchmark data available yet."
+The autoloop iteration only needs to add the benchmark scripts; it does **not** need to run them or update `benchmarks/results.json`. The pages workflow (`.github/workflows/pages.yml`) executes `benchmarks/run_benchmarks.sh` on every push to `main` and publishes the regenerated `results.json` to the playground site, so real benchmark data appears on `playground/benchmarks.html` once the autoloop branch is merged.
 
 ### Key constraints
 
@@ -50,58 +50,23 @@ Do NOT modify:
 
 ## Evaluation
 
-The evaluation runs `benchmarks/run_benchmarks.sh`, which executes every TS/Python
-benchmark pair and writes real timing data to `benchmarks/results.json`. The metric
-is the number of benchmarks that appear in that regenerated file — i.e. the number
-of function pairs whose benchmarks actually ran to completion and produced valid
-JSON output. This means a benchmark pair is only "counted" if it truly runs, and
-the committed `benchmarks/results.json` always reflects real data that the
-`pages.yml` workflow will copy to the playground on merge to `main`.
-
 ```bash
-set -euo pipefail
-
-# Ensure Python and pandas are available
+# Set up Python environment if needed
 if ! command -v python3 &>/dev/null; then
-  echo "ERROR: python3 is required but not found" >&2
-  exit 1
-fi
-python3 -c "import pandas" 2>/dev/null || pip3 install pandas --quiet
-
-# Ensure Bun is available (install if missing — autoloop runners may not have it).
-# Failure to install Bun is logged but does not abort the script, because we must
-# still emit the final metric line for autoloop to parse.
-if ! command -v bun &>/dev/null; then
-  curl -fsSL https://bun.sh/install | bash || echo "WARN: bun install script failed" >&2
-  export PATH="$HOME/.bun/bin:$PATH"
-fi
-if ! command -v bun &>/dev/null; then
-  echo "ERROR: bun is not available after install attempt; benchmarks will fail" >&2
+  echo "Python3 not found, skipping"
 fi
+pip3 install pandas --quiet 2>/dev/null || true
+
+# Count the number of benchmark pairs (functions with both TS and Python benchmarks)
+ts_benchmarks=$(ls benchmarks/tsb/bench_*.ts 2>/dev/null | wc -l | tr -d ' ')
+py_benchmarks=$(ls benchmarks/pandas/bench_*.py 2>/dev/null | wc -l | tr -d ' ')
 
-# Install JS/TS dependencies so benchmark scripts can import from src/.
-# `|| true` keeps the script alive so the final metric is still emitted; any
-# errors are visible in the autoloop logs for debugging.
-bun install --silent || echo "WARN: bun install failed; benchmarks may fail to import src/" >&2
-
-# Run every benchmark pair and regenerate benchmarks/results.json with real data.
-# This is the file .github/workflows/pages.yml copies into the playground, so
-# committing it here is what makes real benchmark data appear on the pages site
-# once the autoloop branch is merged to main. Output is left visible so
-# per-benchmark failures can be diagnosed from autoloop logs; `|| true` ensures
-# we still reach the metric emission below if the script exits nonzero.
-bash benchmarks/run_benchmarks.sh || echo "WARN: run_benchmarks.sh exited nonzero" >&2
-
-# Metric: number of benchmark entries in the regenerated results.json.
-count=$(python3 -c "
-import json
-try:
-    with open('benchmarks/results.json') as f:
-        data = json.load(f)
-    print(len(data.get('benchmarks', [])))
-except Exception:
-    print(0)
-")
+# The metric is the minimum of the two (both must exist for a complete benchmark)
+if [ "$ts_benchmarks" -lt "$py_benchmarks" ]; then
+  count=$ts_benchmarks
+else
+  count=$py_benchmarks
+fi
 
 echo "{\"benchmarked_functions\": ${count:-0}}"
 ```

diff --git a/.github/workflows/pages.yml b/.github/workflows/pages.yml
@@ -36,13 +36,6 @@ jobs:
       - name: Bundle TypeScript compiler for offline playground
         run: cp node_modules/typescript/lib/typescript.js ./playground/dist/typescript.js
 
-      - name: Copy benchmark results to playground
-        run: |
-          mkdir -p ./playground/benchmarks
-          if [ -f benchmarks/results.json ]; then
-            cp benchmarks/results.json ./playground/benchmarks/results.json
-          fi
-
       - name: Setup Python
         uses: actions/setup-python@v5
         with:
@@ -51,6 +44,14 @@ jobs:
       - name: Install Python dependencies
         run: pip install pandas numpy
 
+      - name: Run benchmarks
+        run: bash benchmarks/run_benchmarks.sh
+
+      - name: Copy benchmark results to playground
+        run: |
+          mkdir -p ./playground/benchmarks
+          cp benchmarks/results.json ./playground/benchmarks/results.json
+
       - name: Validate Python playground examples
         run: python scripts/validate-python-examples.py playground/