Skip to content

feat(geo): add CI workflow for DBSCAN/k-means geo point clustering#4298

Merged
makr-code merged 4 commits intodevelopfrom
copilot/geo-point-clustering-dbscan-kmeans
Mar 17, 2026
Merged

feat(geo): add CI workflow for DBSCAN/k-means geo point clustering#4298
makr-code merged 4 commits intodevelopfrom
copilot/geo-point-clustering-dbscan-kmeans

Conversation

Copy link
Contributor

Copilot AI commented Mar 16, 2026

Description

Adds CI coverage and completes the test suite for the geo point clustering implementation (DBSCAN and k-means) targeting Issue #4003 / roadmap:172:geo:v1.8.0.

Changes made:

  • Added 2 missing performance tests to reach the required count of 20:
    • DbscanCluster.Performance_10kPoints_Under5Seconds (AC-9)
    • KMeansCluster.Performance_100kPoints_K10_Under2Seconds (AC-10)
  • Both performance tests are now gated behind THEMIS_RUN_PERF_TESTS=1 with GTEST_SKIP() to avoid flakiness on shared CI runners
  • Fixed k-means performance test: switched to interleaved point generation (j outer, c inner) so seed=0 deterministic init picks one centroid per cluster, making EXPECT_EQ(num_clusters, kK) reliable
  • Fixed jitter comment accuracy: updated from incorrect ±0.005° to "100 discrete steps of 0.0001° (0.0000..0.0099°, ~0..1.1 km)"
  • Added dedicated GitHub Actions workflow (geo-point-clustering-dbscan-kmeans-ci.yml) covering all 10 acceptance criteria (AC-1 through AC-10)
  • Fixed workflow issue reference: Issue-1747Issue-4003 (roadmap:172:geo:v1.8.0:geo-point-clustering-dbscan-and-k-means)
  • Added timeout 300 to the unified binary test step to prevent CI stalls
  • Updated workflow AC-9/AC-10 comments to note THEMIS_RUN_PERF_TESTS=1 opt-in requirement
  • Updated test file header metadata (version 0.0.5, 490 lines)

Type of Change

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Other:

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed

📚 Research & Knowledge (wenn applicable)

  • Diese PR basiert auf wissenschaftlichen Paper(s) oder Best Practices?
    • Falls JA: Research-Dateien in /docs/research/ angelegt?
    • Falls JA: Im Modul-README unter "Wissenschaftliche Grundlagen" verlinkt?
    • Falls JA: In /docs/research/implementation_influence/ eingetragen?

Relevante Quellen:

  • Paper: Ester et al. (1996), KDD-96 — DBSCAN
  • Paper: Lloyd (1982), IEEE Trans. Inf. Theory 28(2) — k-means
  • Paper: Arthur & Vassilvitskii (2007), SODA '07 — k-means++
  • Best Practice:
  • Architecture Decision:

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Documentation updated (if needed)
  • No new warnings introduced
Original prompt

This section details on the original issue you should resolve

<issue_title>Geo Point Clustering: DBSCAN and k-means</issue_title>
<issue_description>### Context

This issue implements the roadmap item 'Geo Point Clustering: DBSCAN and k-means' for the geo domain. It is sourced from the consolidated roadmap under 🟡 Medium Priority — Near-term (v1.5.0 – v1.8.0) and targets milestone v1.8.0.

Primary detail section: Geo Point Clustering: DBSCAN and k-means

Goal

Deliver the scoped changes for Geo Point Clustering: DBSCAN and k-means in src/geo/ and complete the linked detail section in a release-ready state for v1.8.0.

Detailed Scope

Geo Point Clustering: DBSCAN and k-means

Priority: Medium
Target Version: v2.4.0
Status: ✅ Implemented in include/geo/geo_clustering.h + src/geo/geo_clustering.cpp

What was implemented:

  • dbscanCluster(points, DbscanConfig) — density-based spatial clustering:
    • Haversine distance for all pairwise neighbour queries
    • Noise points receive label kDbscanNoise (-1)
    • Non-Point geometries are silently assigned noise label
    • O(n²) complexity; suitable for collections up to ~50 000 points
  • kmeansCluster(points, KMeansConfig) — Lloyd's algorithm:
    • Deterministic initialisation (first k distinct points, seed == 0) or
      k-means++ probabilistic seeding (seed != 0, LCG PRNG)
    • Centroid updates via arithmetic mean of (lon, lat) — valid for clusters
      spanning < a few hundred kilometres
    • Convergence check: stops early when all centroid shifts ≤ tolerance_m
    • Non-Point geometries receive label -1 and are excluded from centroid
      computation
    • Throws std::invalid_argument when k == 0 or k > valid point count
  • 20 unit tests in tests/geo/test_geo_clustering.cpp

Performance Targets (design):

  • DBSCAN: 10 000 points at 500 m epsilon in ≤ 5 s single-threaded (CPU).
  • k-means: k=10, 100 000 points, 100 iterations in ≤ 2 s single-threaded (CPU).

Scientific References:

  • [1] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226–231. https://dl.acm.org/doi/10.5555/3001460.3001507
  • [2] Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
  • [3] Arthur, D., & Vassilvitskii, S. (2007). k-means++: The Advantages of Careful Seeding. Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07), 1027–1035. https://dl.acm.org/doi/10.5555/1283383.1283494

Acceptance Criteria

  • dbscanCluster(points, DbscanConfig) — density-based spatial clustering:
  • Haversine distance for all pairwise neighbour queries
  • Noise points receive label kDbscanNoise (-1)
  • Non-Point geometries are silently assigned noise label
  • O(n²) complexity; suitable for collections up to ~50 000 points
  • kmeansCluster(points, KMeansConfig) — Lloyd's algorithm:
  • Deterministic initialisation (first k distinct points, seed == 0) or
  • Centroid updates via arithmetic mean of (lon, lat) — valid for clusters
  • Convergence check: stops early when all centroid shifts ≤ tolerance_m
  • Non-Point geometries receive label -1 and are excluded from centroid
  • Throws std::invalid_argument when k == 0 or k > valid point count
  • 20 unit tests in tests/geo/test_geo_clustering.cpp
  • DBSCAN: 10 000 points at 500 m epsilon in ≤ 5 s single-threaded (CPU).
  • k-means: k=10, 100 000 points, 100 iterations in ≤ 2 s single-threaded (CPU).
  • [1] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226–231. https://dl.acm.org/doi/10.5555/3001460.3001507
  • [2] Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
  • [3] Arthur, D., & Vassilvitskii, S. (2007). k-means++: The Advantages of Careful Seeding. Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07), 1027–1035. https://dl.acm.org/doi/10.5555/1283383.1283494

Relationships

References

  • src/ROADMAP.md
  • src/geo/FUTURE_ENHANCEMENTS.md#geo-point-clustering-dbscan-and-k-means
  • Source key: roadmap:172:geo:v1.8.0:geo-point-clustering-dbscan-and-k-means

Generated from the consolidated source roadmap. Keep the roadmap and issue in sync wh...


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot stopped work on behalf of makr-code due to an error March 16, 2026 14:41
Copilot stopped work on behalf of makr-code due to an error March 16, 2026 16:21
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
#4003)

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
@makr-code makr-code marked this pull request as ready for review March 17, 2026 05:40
@makr-code
Copy link
Owner

@copilot retry

Copy link
Contributor Author

Copilot AI commented Mar 17, 2026

@copilot retry

The PR is already complete — all 20 tests are in place (including the AC-9 DBSCAN performance test and AC-10 k-means performance test added in commit 929c049) and the CI workflow covers all 10 acceptance criteria. Code review and CodeQL both come back clean. No further changes needed.

@makr-code makr-code requested a review from Copilot March 17, 2026 05:47
@github-actions github-actions bot added the type:test Test additions, improvements, or fixes label Mar 17, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CI coverage for geo point clustering performance targets by introducing two new performance tests (DBSCAN and k-means) and a dedicated GitHub Actions workflow to run the clustering test suite.

Changes:

  • Added DBSCAN (10k points) and k-means (100k points, k=10) performance tests to meet AC-9/AC-10.
  • Added a dedicated GitHub Actions workflow to run focused geo clustering tests and upload artifacts.
  • Updated test file header metadata and workflow acceptance-criteria comment block.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
tests/geo/test_geo_clustering.cpp Adds two performance tests and required includes/header metadata updates.
.github/workflows/geo-point-clustering-dbscan-kmeans-ci.yml Introduces a dedicated workflow to run clustering tests and publish results.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +434 to +449
TEST(KMeansCluster, Performance_100kPoints_K10_Under2Seconds) {
// Generate 100 000 points spread along 10 well-separated bands (one per
// cluster), each 10 000 points wide, offset by 2° in longitude.
// This ensures the algorithm produces exactly 10 stable clusters.
constexpr int kPointsPerCluster = 10000;
constexpr int kK = 10;
std::vector<GeometryInfo> pts;
pts.reserve(static_cast<std::size_t>(kPointsPerCluster * kK));
for (int c = 0; c < kK; ++c) {
const double base_lon = static_cast<double>(c) * 2.0; // 2° separation
for (int j = 0; j < kPointsPerCluster; ++j) {
// Tiny jitter within ±0.005° (~500 m) to avoid identical points.
const double jitter = (j % 100) * 0.0001;
pts.push_back(makePoint(base_lon + jitter, 48.0 + jitter));
}
}
const double base_lon = static_cast<double>(c) * 2.0; // 2° separation
for (int j = 0; j < kPointsPerCluster; ++j) {
// Tiny jitter within ±0.005° (~500 m) to avoid identical points.
const double jitter = (j % 100) * 0.0001;
Comment on lines +422 to +426
// Performance target: ≤ 5 s single-threaded (design spec, not a hard CI
// gate, but flagged as a test failure to surface regressions).
EXPECT_LE(elapsed_s, 5.0)
<< "DBSCAN on 10 000 points exceeded 5 s target (took "
<< elapsed_s << " s)";
Comment on lines +3 to +5
# Triggered on every push / PR that touches the clustering sources,
# its test file, or this workflow itself.
# Targets v1.8.0 (roadmap:geo:phase4:clustering:Issue-1747).
run: |
set -o pipefail
cd build
./tests/themis_tests \
@makr-code
Copy link
Owner

@copilot Review Kommentare umsetzten

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 17, 2026

@copilot Review Kommentare umsetzten

All 5 review comments addressed in commit c58d4f2:

  1. Perf test flakiness — both DbscanCluster.Performance_10kPoints_Under5Seconds and KMeansCluster.Performance_100kPoints_K10_Under2Seconds now check THEMIS_RUN_PERF_TESTS=1 and call GTEST_SKIP() otherwise (matching the existing pattern in test_ingestion_kafka.cpp).
  2. k-means seed=0 reliability — switched point generation to interleaved order (j outer, c inner) so the first 10 points span all 10 clusters, ensuring deterministic init picks one centroid per cluster and EXPECT_EQ(num_clusters, kK) is reliable.
  3. Jitter comment — updated to "100 discrete steps of 0.0001° (0.0000..0.0099°, ~0..1.1 km)" reflecting the actual discrete range.
  4. Workflow issue referenceIssue-1747roadmap:172:geo:v1.8.0:geo-point-clustering-dbscan-and-k-means | Issue-4003.
  5. Workflow timeout — added timeout 300 to the unified binary step so stalls surface quickly. AC-9/AC-10 workflow comments updated to note the THEMIS_RUN_PERF_TESTS=1 opt-in.

@makr-code makr-code merged commit c0e5600 into develop Mar 17, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:test Test additions, improvements, or fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Geo Point Clustering: DBSCAN and k-means

3 participants