Skip to content

Add Performance Benchmarks for Collection Throughput #213

@leostar0412

Description

@leostar0412

Problem

No benchmarks exist for collection throughput, API rate-limit compliance, or large-dataset import performance. The system may process thousands of GitHub commits, Slack messages, or Discord events per run, yet the CI pipeline has no regression check for performance. Degradation would surface only as increased production latency or, worse, as silent timeouts that cause partial data collection. The 90% coverage gate verifies correctness but not performance characteristics.

Acceptance Criteria

  • Add a benchmarks/ directory with at least two benchmark scenarios: (a) GitHub collector processing N mock commits; (b) a service-layer bulk insert of N records
  • Use pytest-benchmark or a timing harness that emits machine-readable results (JSON)
  • Establish baseline numbers in a checked-in reference file or CI artifact
  • Add a CI step (can be manual/nightly, not necessarily on every PR) that runs benchmarks and flags regressions >25%
  • Document how to run benchmarks locally in CONTRIBUTING.md

Implementation Notes

Start with the highest-volume collector (likely github_activity_tracker). Mock the API responses and measure: records-per-second through the service layer, memory high-water mark, and database write throughput. Use pytest-benchmark with --benchmark-json output. The mock fixtures from the existing test suite can be extended for this purpose. Keep benchmarks in a separate pytest marker (@pytest.mark.benchmark) so they don't slow normal CI.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions