Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 14 additions & 49 deletions .autoloop/programs/perf-comparison/program.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ This is an open-ended program — it runs continuously, always adding the next b
- Outputs the same JSON format
5. **Update `playground/benchmarks.html`** if needed to display the new function's comparison metrics.

The evaluation step (below) runs `benchmarks/run_benchmarks.sh` to execute **every** TS/Python benchmark pair and regenerates `benchmarks/results.json` with the real timing data. That regenerated file is what gets committed on a successful iteration, so when the autoloop branch is merged to `main`, the pages workflow (`.github/workflows/pages.yml`) picks up the real results and `playground/benchmarks.html` renders real comparison data instead of "No benchmark data available yet."
The autoloop iteration only needs to add the benchmark scripts; it does **not** need to run them or update `benchmarks/results.json`. The pages workflow (`.github/workflows/pages.yml`) executes `benchmarks/run_benchmarks.sh` on every push to `main` and publishes the regenerated `results.json` to the playground site, so real benchmark data appears on `playground/benchmarks.html` once the autoloop branch is merged.

### Key constraints

Expand All @@ -50,58 +50,23 @@ Do NOT modify:

## Evaluation

The evaluation runs `benchmarks/run_benchmarks.sh`, which executes every TS/Python
benchmark pair and writes real timing data to `benchmarks/results.json`. The metric
is the number of benchmarks that appear in that regenerated file — i.e. the number
of function pairs whose benchmarks actually ran to completion and produced valid
JSON output. This means a benchmark pair is only "counted" if it truly runs, and
the committed `benchmarks/results.json` always reflects real data that the
`pages.yml` workflow will copy to the playground on merge to `main`.

```bash
set -euo pipefail

# Ensure Python and pandas are available
# Set up Python environment if needed
if ! command -v python3 &>/dev/null; then
echo "ERROR: python3 is required but not found" >&2
exit 1
fi
python3 -c "import pandas" 2>/dev/null || pip3 install pandas --quiet

# Ensure Bun is available (install if missing — autoloop runners may not have it).
# Failure to install Bun is logged but does not abort the script, because we must
# still emit the final metric line for autoloop to parse.
if ! command -v bun &>/dev/null; then
curl -fsSL https://bun.sh/install | bash || echo "WARN: bun install script failed" >&2
export PATH="$HOME/.bun/bin:$PATH"
fi
if ! command -v bun &>/dev/null; then
echo "ERROR: bun is not available after install attempt; benchmarks will fail" >&2
echo "Python3 not found, skipping"
fi
pip3 install pandas --quiet 2>/dev/null || true

# Count the number of benchmark pairs (functions with both TS and Python benchmarks)
ts_benchmarks=$(ls benchmarks/tsb/bench_*.ts 2>/dev/null | wc -l | tr -d ' ')
py_benchmarks=$(ls benchmarks/pandas/bench_*.py 2>/dev/null | wc -l | tr -d ' ')

# Install JS/TS dependencies so benchmark scripts can import from src/.
# `|| true` keeps the script alive so the final metric is still emitted; any
# errors are visible in the autoloop logs for debugging.
bun install --silent || echo "WARN: bun install failed; benchmarks may fail to import src/" >&2

# Run every benchmark pair and regenerate benchmarks/results.json with real data.
# This is the file .github/workflows/pages.yml copies into the playground, so
# committing it here is what makes real benchmark data appear on the pages site
# once the autoloop branch is merged to main. Output is left visible so
# per-benchmark failures can be diagnosed from autoloop logs; `|| true` ensures
# we still reach the metric emission below if the script exits nonzero.
bash benchmarks/run_benchmarks.sh || echo "WARN: run_benchmarks.sh exited nonzero" >&2

# Metric: number of benchmark entries in the regenerated results.json.
count=$(python3 -c "
import json
try:
with open('benchmarks/results.json') as f:
data = json.load(f)
print(len(data.get('benchmarks', [])))
except Exception:
print(0)
")
# The metric is the minimum of the two (both must exist for a complete benchmark)
if [ "$ts_benchmarks" -lt "$py_benchmarks" ]; then
count=$ts_benchmarks
else
count=$py_benchmarks
fi

echo "{\"benchmarked_functions\": ${count:-0}}"
```
Expand Down
15 changes: 8 additions & 7 deletions .github/workflows/pages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,6 @@ jobs:
- name: Bundle TypeScript compiler for offline playground
run: cp node_modules/typescript/lib/typescript.js ./playground/dist/typescript.js

- name: Copy benchmark results to playground
run: |
mkdir -p ./playground/benchmarks
if [ -f benchmarks/results.json ]; then
cp benchmarks/results.json ./playground/benchmarks/results.json
fi

- name: Setup Python
uses: actions/setup-python@v5
with:
Expand All @@ -51,6 +44,14 @@ jobs:
- name: Install Python dependencies
run: pip install pandas numpy

- name: Run benchmarks
run: bash benchmarks/run_benchmarks.sh

- name: Copy benchmark results to playground
run: |
mkdir -p ./playground/benchmarks
cp benchmarks/results.json ./playground/benchmarks/results.json

- name: Validate Python playground examples
run: python scripts/validate-python-examples.py playground/

Expand Down
Loading