Skip to content

Add GitHub Actions workflow for benchmark load tests with baseline comparison#801

Closed
Copilot wants to merge 10 commits into
mainfrom
copilot/add-load-test-github-action
Closed

Add GitHub Actions workflow for benchmark load tests with baseline comparison#801
Copilot wants to merge 10 commits into
mainfrom
copilot/add-load-test-github-action

Conversation

Copy link
Copy Markdown

Copilot AI commented Nov 21, 2025

Automates load testing on PRs touching websocket server or benchmark code. Posts formatted results with baseline comparison as PR comments.

Implementation

  • Workflow trigger: PR changes to websocket_server/, tools/benchmarks/, or workflow file; manual via workflow_dispatch
  • CI-optimized execution: 50, 100 user concurrency over 45s (vs production 50, 100, 300, 500)
  • Baseline comparison: Reads baseline from tools/benchmarks/results.json and shows deltas with color indicators (🟢 improvement, 🔴 regression, ⚪ no change)
  • Result formatting: Markdown table with CPU/memory/network/message metrics + collapsible JSON details
  • Error handling: JSON validation via Node.js, graceful failure messaging, artifact upload on all outcomes

Example Output

The workflow produces PR comments with baseline comparison including network and message metrics:

## 🚀 Benchmark Load Test Results

Comparison with baseline from `tools/benchmarks/results.json`.
Legend: 🟢 improvement, 🔴 regression, ⚪ no change

| Concurrent Users | Avg CPU | Peak CPU | Avg Memory (MB) | Peak Memory (MB) | Bytes Sent | Bytes Received | Messages Sent | Messages Received | Dropped Connections |
|------------------|---------|----------|-----------------|------------------|------------|----------------|---------------|-------------------|--------------------|
| 50 | 11.50% (🔴+1.03%) | 50.00% (🔴+2.30%) | 250.00 (🔴+7.23) | 270.00 (🔴+6.61) | 908,445 (🔴+759,123) | 70,074,578 (🔴+46,579,353) | 5,323 | 109,385 | 0 (⚪0) |
| 100 | 18.50% (🟢-1.37%) | 52.00% (🟢-3.60%) | 350.00 (🟢-13.73) | 365.00 (🟢-5.77) | 1,800,000 (🔴+1,500,981) | 140,000,000 (🔴+43,389,016) | 10,500 | 218,000 | 0 (⚪0) |

<details>
<summary>📊 50 concurrent users</summary>
...detailed JSON...
</details>

Network byte metrics show diffs compared to baseline, while message counts are new metrics not in the baseline. Large numbers are formatted with thousand separators for readability.

Uses existing tools/benchmarks/runBenchmarks.mjs script with extraction via sed and validation via Node.js JSON.parse.

Original prompt

On benchmarks, I want to have a github action that automatically runs the load test as described in the readme and prints the result. If it runs on a pull request I want to post a comment with the result


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 5 commits November 21, 2025 08:01
Co-authored-by: juliusknorr <3404133+juliusknorr@users.noreply.github.com>
Co-authored-by: juliusknorr <3404133+juliusknorr@users.noreply.github.com>
Co-authored-by: juliusknorr <3404133+juliusknorr@users.noreply.github.com>
Co-authored-by: juliusknorr <3404133+juliusknorr@users.noreply.github.com>
Co-authored-by: juliusknorr <3404133+juliusknorr@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🚀 Benchmark Load Test Results

These results show the performance characteristics of the websocket server under load.

Concurrent Users Avg CPU Peak CPU Avg Memory (MB) Peak Memory (MB) Dropped Connections
50 31.09% 132.00% 216.55 223.71 0
100 63.76% 145.00% 418.70 498.18 0

Details

📊 50 concurrent users
{
  "concurrency": 50,
  "cpu": {
    "average": 31.091304347826078,
    "peak": 132
  },
  "memory": {
    "averageRssMb": 216.54976222826087,
    "peakRssMb": 223.71484375
  },
  "network": {
    "bytesIn": 0,
    "bytesOut": 0
  },
  "loadSummary": {
    "serverUrl": "http://127.0.0.1:3002",
    "roomId": "benchmark-room",
    "concurrency": 50,
    "activeSenders": 15,
    "activeRatio": 0.3,
    "durationSeconds": 45,
    "updateRate": 3,
    "bytesSent": 457958,
    "bytesReceived": 28560682,
    "messagesSent": 2697,
    "messagesReceived": 72316,
    "averageJoinDelayMs": 109.18,
    "droppedConnections": 0
  }
}
📊 100 concurrent users
{
  "concurrency": 100,
  "cpu": {
    "average": 63.76382978723404,
    "peak": 145
  },
  "memory": {
    "averageRssMb": 418.6961436170213,
    "peakRssMb": 498.17578125
  },
  "network": {
    "bytesIn": 0,
    "bytesOut": 0
  },
  "loadSummary": {
    "serverUrl": "http://127.0.0.1:3002",
    "roomId": "benchmark-room",
    "concurrency": 100,
    "activeSenders": 30,
    "activeRatio": 0.3,
    "durationSeconds": 45,
    "updateRate": 3,
    "bytesSent": 917154,
    "bytesReceived": 103672940,
    "messagesSent": 5374,
    "messagesReceived": 182705,
    "averageJoinDelayMs": 239.39,
    "droppedConnections": 0
  }
}

Note: These benchmarks run with reduced load (50, 100 users) for CI efficiency. For full benchmarks, see the README.

Copilot AI changed the title [WIP] Add GitHub action for automated load testing Add GitHub Actions workflow for benchmark load tests Nov 21, 2025
Copilot AI requested a review from juliusknorr November 21, 2025 08:10
@juliusknorr
Copy link
Copy Markdown
Member

@copilot I'd like to store a baseline of the output in the repo and when posting the pull request comment show the diffed values.

Co-authored-by: juliusknorr <3404133+juliusknorr@users.noreply.github.com>
Copy link
Copy Markdown
Author

Copilot AI commented Nov 21, 2025

@copilot I'd like to store a baseline of the output in the repo and when posting the pull request comment show the diffed values.

Implemented in commit a042309. The workflow now:

  1. Reads baseline from tools/benchmarks/results.json
  2. Compares current results with baseline
  3. Shows deltas with color indicators:
    • 🟢 improvements (lower CPU/memory, fewer dropped connections)
    • 🔴 regressions (higher CPU/memory, more dropped connections)
    • ⚪ no change

Example output:

| Concurrent Users | Avg CPU | Peak CPU | Avg Memory (MB) | Peak Memory (MB) | Dropped Connections |
|------------------|---------|----------|-----------------|------------------|--------------------|
| 50 | 11.50% (🔴+1.03%) | 50.00% (🔴+2.30%) | 250.00 (🔴+7.23) | 270.00 (🔴+6.61) | 0 (⚪0) |
| 100 | 18.50% (🟢-1.37%) | 52.00% (🟢-3.60%) | 350.00 (🟢-13.73) | 365.00 (🟢-5.77) | 0 (⚪0) |

Falls back to regular display if baseline is not found.

Co-authored-by: juliusknorr <3404133+juliusknorr@users.noreply.github.com>
Copilot AI changed the title Add GitHub Actions workflow for benchmark load tests Add GitHub Actions workflow for benchmark load tests with baseline comparison Nov 21, 2025
@github-actions
Copy link
Copy Markdown
Contributor

🚀 Benchmark Load Test Results

Comparison with baseline from tools/benchmarks/results.json.
Legend: 🟢 improvement, 🔴 regression, ⚪ no change

Concurrent Users Avg CPU Peak CPU Avg Memory (MB) Peak Memory (MB) Dropped Connections
50 0.00% (🟢-10.47%) 0.00% (🟢-47.70%) 224.00 (🟢-18.77) 245.34 (🟢-18.05) 0 (⚪0)
100 0.00% (🟢-19.87%) 0.00% (🟢-55.60%) 327.62 (🟢-36.11) 369.86 (🟢-0.91) 0 (⚪0)

Details

📊 50 concurrent users
{
  "concurrency": 50,
  "cpu": {
    "average": 0,
    "peak": 0
  },
  "memory": {
    "averageRssMb": 223.9958444148936,
    "peakRssMb": 245.3359375
  },
  "network": {
    "bytesIn": 0,
    "bytesOut": 0
  },
  "loadSummary": {
    "serverUrl": "http://127.0.0.1:3002",
    "roomId": "benchmark-room",
    "concurrency": 50,
    "activeSenders": 15,
    "activeRatio": 0.3,
    "durationSeconds": 45,
    "updateRate": 3,
    "bytesSent": 457512,
    "bytesReceived": 34522513,
    "messagesSent": 2694,
    "messagesReceived": 104054,
    "averageJoinDelayMs": 142.28,
    "droppedConnections": 0
  }
}
📊 100 concurrent users
{
  "concurrency": 100,
  "cpu": {
    "average": 0,
    "peak": 0
  },
  "memory": {
    "averageRssMb": 327.6195977393617,
    "peakRssMb": 369.85546875
  },
  "network": {
    "bytesIn": 0,
    "bytesOut": 0
  },
  "loadSummary": {
    "serverUrl": "http://127.0.0.1:3002",
    "roomId": "benchmark-room",
    "concurrency": 100,
    "activeSenders": 30,
    "activeRatio": 0.3,
    "durationSeconds": 45,
    "updateRate": 3,
    "bytesSent": 915782,
    "bytesReceived": 103712476,
    "messagesSent": 5366,
    "messagesReceived": 250862,
    "averageJoinDelayMs": 317.19,
    "droppedConnections": 0
  }
}

Note: These benchmarks run with reduced load (50, 100 users) for CI efficiency. For full benchmarks, see the README.

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Benchmark Load Test Results

Comparison with baseline from tools/benchmarks/results.json.
Legend: 🟢 improvement, 🔴 regression, ⚪ no change

Concurrent Users Avg CPU Peak CPU Avg Memory (MB) Peak Memory (MB) Dropped Connections
50 0.00% (🟢-10.47%) 0.00% (🟢-47.70%) 167.21 (🟢-75.56) 179.20 (🟢-84.19) 0 (⚪0)
100 0.00% (🟢-19.87%) 0.00% (🟢-55.60%) 281.16 (🟢-82.57) 307.07 (🟢-63.70) 0 (⚪0)

Details

📊 50 concurrent users
{
  "concurrency": 50,
  "cpu": {
    "average": 0,
    "peak": 0
  },
  "memory": {
    "averageRssMb": 167.21017323369566,
    "peakRssMb": 179.203125
  },
  "network": {
    "bytesIn": 0,
    "bytesOut": 0
  },
  "loadSummary": {
    "serverUrl": "http://127.0.0.1:3002",
    "roomId": "benchmark-room",
    "concurrency": 50,
    "activeSenders": 15,
    "activeRatio": 0.3,
    "durationSeconds": 45,
    "updateRate": 3,
    "bytesSent": 456453,
    "bytesReceived": 25743126,
    "messagesSent": 2688,
    "messagesReceived": 68888,
    "averageJoinDelayMs": 259.12,
    "droppedConnections": 0
  }
}
📊 100 concurrent users
{
  "concurrency": 100,
  "cpu": {
    "average": 0,
    "peak": 0
  },
  "memory": {
    "averageRssMb": 281.1586277173913,
    "peakRssMb": 307.07421875
  },
  "network": {
    "bytesIn": 0,
    "bytesOut": 0
  },
  "loadSummary": {
    "serverUrl": "http://127.0.0.1:3002",
    "roomId": "benchmark-room",
    "concurrency": 100,
    "activeSenders": 30,
    "activeRatio": 0.3,
    "durationSeconds": 45,
    "updateRate": 3,
    "bytesSent": 908445,
    "bytesReceived": 70074578,
    "messagesSent": 5323,
    "messagesReceived": 109385,
    "averageJoinDelayMs": 607.58,
    "droppedConnections": 0
  }
}

Note: These benchmarks run with reduced load (50, 100 users) for CI efficiency. For full benchmarks, see the README.

@juliusknorr
Copy link
Copy Markdown
Member

@copilot Those values should also be included in the table and diff

"bytesSent": 908445,
"bytesReceived": 70074578,
"messagesSent": 5323,
"messagesReceived": 109385,

@juliusknorr
Copy link
Copy Markdown
Member

@copilot Those values should also be included in the table and diff

@juliusknorr
Copy link
Copy Markdown
Member

@copilot Those values should also be included in the table and diff

"bytesSent": 908445,
"bytesReceived": 70074578,
"messagesSent": 5323,
"messagesReceived": 109385,

Co-authored-by: juliusknorr <3404133+juliusknorr@users.noreply.github.com>
Copy link
Copy Markdown
Author

Copilot AI commented Nov 21, 2025

@copilot Those values should also be included in the table and diff

"bytesSent": 908445,...

Implemented in commit de03398. Added four new columns to the results table:

  • Bytes Sent - with diff vs baseline ingressBytes
  • Bytes Received - with diff vs baseline egressBytes
  • Messages Sent - formatted with thousand separators
  • Messages Received - formatted with thousand separators

Example output:

| Concurrent Users | Avg CPU | Peak CPU | Avg Memory (MB) | Peak Memory (MB) | Bytes Sent | Bytes Received | Messages Sent | Messages Received | Dropped Connections |
|------------------|---------|----------|-----------------|------------------|------------|----------------|---------------|-------------------|--------------------|
| 50 | 11.50% (🔴+1.03%) | 50.00% (🔴+2.30%) | 250.00 (🔴+7.23) | 270.00 (🔴+6.61) | 908,445 (🔴+759,123) | 70,074,578 (🔴+46,579,353) | 5,323 | 109,385 | 0 (⚪0) |

Network bytes show diffs compared to baseline, while message counts are new metrics not in the baseline.

Co-authored-by: juliusknorr <3404133+juliusknorr@users.noreply.github.com>
@juliusknorr
Copy link
Copy Markdown
Member

Closing as I was just experimenting, don't think the baseline comparison is useful yet, we could think about other metrics to compare though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants