feat(geo): add CI workflow for DBSCAN/k-means geo point clustering by Copilot · Pull Request #4298 · makr-code/ThemisDB

Copilot · 2026-03-16T13:30:34Z

Description

Adds CI coverage and completes the test suite for the geo point clustering implementation (DBSCAN and k-means) targeting Issue #4003 / roadmap:172:geo:v1.8.0.

Changes made:

Added 2 missing performance tests to reach the required count of 20:
- DbscanCluster.Performance_10kPoints_Under5Seconds (AC-9)
- KMeansCluster.Performance_100kPoints_K10_Under2Seconds (AC-10)
Both performance tests are now gated behind THEMIS_RUN_PERF_TESTS=1 with GTEST_SKIP() to avoid flakiness on shared CI runners
Fixed k-means performance test: switched to interleaved point generation (j outer, c inner) so seed=0 deterministic init picks one centroid per cluster, making EXPECT_EQ(num_clusters, kK) reliable
Fixed jitter comment accuracy: updated from incorrect ±0.005° to "100 discrete steps of 0.0001° (0.0000..0.0099°, ~0..1.1 km)"
Added dedicated GitHub Actions workflow (geo-point-clustering-dbscan-kmeans-ci.yml) covering all 10 acceptance criteria (AC-1 through AC-10)
Fixed workflow issue reference: Issue-1747 → Issue-4003 (roadmap:172:geo:v1.8.0:geo-point-clustering-dbscan-and-k-means)
Added timeout 300 to the unified binary test step to prevent CI stalls
Updated workflow AC-9/AC-10 comments to note THEMIS_RUN_PERF_TESTS=1 opt-in requirement
Updated test file header metadata (version 0.0.5, 490 lines)

Type of Change

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed

📚 Research & Knowledge (wenn applicable)

Diese PR basiert auf wissenschaftlichen Paper(s) oder Best Practices?
- Falls JA: Research-Dateien in /docs/research/ angelegt?
- Falls JA: Im Modul-README unter "Wissenschaftliche Grundlagen" verlinkt?
- Falls JA: In /docs/research/implementation_influence/ eingetragen?

Relevante Quellen:

Paper: Ester et al. (1996), KDD-96 — DBSCAN
Paper: Lloyd (1982), IEEE Trans. Inf. Theory 28(2) — k-means
Paper: Arthur & Vassilvitskii (2007), SODA '07 — k-means++
Best Practice:
Architecture Decision:

Checklist

Code follows project style guidelines
Self-review completed
Documentation updated (if needed)
No new warnings introduced

Original prompt

This section details on the original issue you should resolve

<issue_title>Geo Point Clustering: DBSCAN and k-means</issue_title>
<issue_description>### Context

This issue implements the roadmap item 'Geo Point Clustering: DBSCAN and k-means' for the geo domain. It is sourced from the consolidated roadmap under 🟡 Medium Priority — Near-term (v1.5.0 – v1.8.0) and targets milestone v1.8.0.

Primary detail section: Geo Point Clustering: DBSCAN and k-means

Goal

Deliver the scoped changes for Geo Point Clustering: DBSCAN and k-means in src/geo/ and complete the linked detail section in a release-ready state for v1.8.0.

Detailed Scope

Geo Point Clustering: DBSCAN and k-means

Priority: Medium
Target Version: v2.4.0
Status: ✅ Implemented in include/geo/geo_clustering.h + src/geo/geo_clustering.cpp

What was implemented:

dbscanCluster(points, DbscanConfig) — density-based spatial clustering:
- Haversine distance for all pairwise neighbour queries
- Noise points receive label kDbscanNoise (-1)
- Non-Point geometries are silently assigned noise label
- O(n²) complexity; suitable for collections up to ~50 000 points
kmeansCluster(points, KMeansConfig) — Lloyd's algorithm:
- Deterministic initialisation (first k distinct points, seed == 0) or
  k-means++ probabilistic seeding (seed != 0, LCG PRNG)
- Centroid updates via arithmetic mean of (lon, lat) — valid for clusters
  spanning < a few hundred kilometres
- Convergence check: stops early when all centroid shifts ≤ tolerance_m
- Non-Point geometries receive label -1 and are excluded from centroid
  computation
- Throws std::invalid_argument when k == 0 or k > valid point count
20 unit tests in tests/geo/test_geo_clustering.cpp

Performance Targets (design):

DBSCAN: 10 000 points at 500 m epsilon in ≤ 5 s single-threaded (CPU).
k-means: k=10, 100 000 points, 100 iterations in ≤ 2 s single-threaded (CPU).

Scientific References:

[1] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226–231. https://dl.acm.org/doi/10.5555/3001460.3001507
[2] Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
[3] Arthur, D., & Vassilvitskii, S. (2007). k-means++: The Advantages of Careful Seeding. Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07), 1027–1035. https://dl.acm.org/doi/10.5555/1283383.1283494

Acceptance Criteria

Relationships

Roadmap row: Optimize PDF generation with compression and complete metadata #172 (🟡 Medium Priority — Near-term (v1.5.0 – v1.8.0))
Depends on: none identified during generation.
Part of: consolidated roadmap delivery tracking.

References

src/ROADMAP.md
src/geo/FUTURE_ENHANCEMENTS.md#geo-point-clustering-dbscan-and-k-means
Source key: roadmap:172:geo:v1.8.0:geo-point-clustering-dbscan-and-k-means

Generated from the consolidated source roadmap. Keep the roadmap and issue in sync wh...

Fixes Geo Point Clustering: DBSCAN and k-means #4003

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>

#4003) Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>

makr-code · 2026-03-17T05:40:40Z

@copilot retry

Copilot · 2026-03-17T05:42:12Z

@copilot retry

The PR is already complete — all 20 tests are in place (including the AC-9 DBSCAN performance test and AC-10 k-means performance test added in commit 929c049) and the CI workflow covers all 10 acceptance criteria. Code review and CodeQL both come back clean. No further changes needed.

Copilot

Pull request overview

Adds CI coverage for geo point clustering performance targets by introducing two new performance tests (DBSCAN and k-means) and a dedicated GitHub Actions workflow to run the clustering test suite.

Changes:

Added DBSCAN (10k points) and k-means (100k points, k=10) performance tests to meet AC-9/AC-10.
Added a dedicated GitHub Actions workflow to run focused geo clustering tests and upload artifacts.
Updated test file header metadata and workflow acceptance-criteria comment block.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
tests/geo/test_geo_clustering.cpp	Adds two performance tests and required includes/header metadata updates.
.github/workflows/geo-point-clustering-dbscan-kmeans-ci.yml	Introduces a dedicated workflow to run clustering tests and publish results.

You can also share your feedback on Copilot code review. Take the survey.

tests/geo/test_geo_clustering.cpp

+TEST(KMeansCluster, Performance_100kPoints_K10_Under2Seconds) {
+    // Generate 100 000 points spread along 10 well-separated bands (one per
+    // cluster), each 10 000 points wide, offset by 2° in longitude.
+    // This ensures the algorithm produces exactly 10 stable clusters.
+    constexpr int kPointsPerCluster = 10000;
+    constexpr int kK                = 10;
+    std::vector<GeometryInfo> pts;
+    pts.reserve(static_cast<std::size_t>(kPointsPerCluster * kK));
+    for (int c = 0; c < kK; ++c) {
+        const double base_lon = static_cast<double>(c) * 2.0; // 2° separation
+        for (int j = 0; j < kPointsPerCluster; ++j) {
+            // Tiny jitter within ±0.005° (~500 m) to avoid identical points.
+            const double jitter = (j % 100) * 0.0001;
+            pts.push_back(makePoint(base_lon + jitter, 48.0 + jitter));
+        }
+    }


tests/geo/test_geo_clustering.cpp

+        const double base_lon = static_cast<double>(c) * 2.0; // 2° separation
+        for (int j = 0; j < kPointsPerCluster; ++j) {
+            // Tiny jitter within ±0.005° (~500 m) to avoid identical points.
+            const double jitter = (j % 100) * 0.0001;


tests/geo/test_geo_clustering.cpp

+    // Performance target: ≤ 5 s single-threaded (design spec, not a hard CI
+    // gate, but flagged as a test failure to surface regressions).
+    EXPECT_LE(elapsed_s, 5.0)
+        << "DBSCAN on 10 000 points exceeded 5 s target (took "
+        << elapsed_s << " s)";


.github/workflows/geo-point-clustering-dbscan-kmeans-ci.yml

+# Triggered on every push / PR that touches the clustering sources,
+# its test file, or this workflow itself.
+# Targets v1.8.0 (roadmap:geo:phase4:clustering:Issue-1747).


.github/workflows/geo-point-clustering-dbscan-kmeans-ci.yml

+        run: |
+          set -o pipefail
+          cd build
+          ./tests/themis_tests \


makr-code · 2026-03-17T06:01:22Z

@copilot Review Kommentare umsetzten

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>

Copilot · 2026-03-17T06:12:19Z

@copilot Review Kommentare umsetzten

All 5 review comments addressed in commit c58d4f2:

Perf test flakiness — both DbscanCluster.Performance_10kPoints_Under5Seconds and KMeansCluster.Performance_100kPoints_K10_Under2Seconds now check THEMIS_RUN_PERF_TESTS=1 and call GTEST_SKIP() otherwise (matching the existing pattern in test_ingestion_kafka.cpp).
k-means seed=0 reliability — switched point generation to interleaved order (j outer, c inner) so the first 10 points span all 10 clusters, ensuring deterministic init picks one centroid per cluster and EXPECT_EQ(num_clusters, kK) is reliable.
Jitter comment — updated to "100 discrete steps of 0.0001° (0.0000..0.0099°, ~0..1.1 km)" reflecting the actual discrete range.
Workflow issue reference — Issue-1747 → roadmap:172:geo:v1.8.0:geo-point-clustering-dbscan-and-k-means | Issue-4003.
Workflow timeout — added timeout 300 to the unified binary step so stalls surface quickly. AC-9/AC-10 workflow comments updated to note the THEMIS_RUN_PERF_TESTS=1 opt-in.

Initial plan

a0841ca

Copilot AI assigned Copilot and makr-code Mar 16, 2026

Copilot stopped work on behalf of makr-code due to an error March 16, 2026 14:41
too many agent jobs in progress

Copilot stopped work on behalf of makr-code due to an error March 16, 2026 16:21
too many agent jobs in progress

Copilot started work on behalf of makr-code March 16, 2026 16:41 View session

feat(geo): add CI workflow for DBSCAN/k-means geo point clustering

b6e969c

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] [v1.8.0] Implement geo point clustering with DBSCAN and k-means~~ feat(geo): add CI workflow for DBSCAN/k-means geo point clustering Mar 16, 2026

Copilot AI requested a review from makr-code March 16, 2026 16:51

Copilot finished work on behalf of makr-code March 16, 2026 16:51

Copilot started work on behalf of makr-code March 16, 2026 17:15 View session

Copilot started work on behalf of makr-code March 16, 2026 17:32 View session

test(geo): add 2 performance tests to reach required count of 20 (Issue

929c049

#4003) Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>

Copilot started work on behalf of makr-code March 16, 2026 19:25 View session

Copilot started work on behalf of makr-code March 16, 2026 20:21 View session

Copilot started work on behalf of makr-code March 16, 2026 21:39 View session

makr-code marked this pull request as ready for review March 17, 2026 05:40

Copilot started work on behalf of makr-code March 17, 2026 05:40 View session

Copilot finished work on behalf of makr-code March 17, 2026 05:42

makr-code requested a review from Copilot March 17, 2026 05:47

github-actions bot added the type:test Test additions, improvements, or fixes label Mar 17, 2026

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Copilot started reviewing on behalf of makr-code March 17, 2026 05:53 View session

Copilot started work on behalf of makr-code March 17, 2026 06:01 View session

fix(geo): address review feedback on perf tests and CI workflow

c58d4f2

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>

Copilot finished work on behalf of makr-code March 17, 2026 06:13

makr-code merged commit c0e5600 into develop Mar 17, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(geo): add CI workflow for DBSCAN/k-means geo point clustering#4298

feat(geo): add CI workflow for DBSCAN/k-means geo point clustering#4298
makr-code merged 4 commits intodevelopfrom
copilot/geo-point-clustering-dbscan-kmeans

Copilot AI commented Mar 16, 2026 •

edited

Loading

Uh oh!

makr-code commented Mar 17, 2026

Uh oh!

Copilot AI commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

makr-code commented Mar 17, 2026

Uh oh!

Copilot AI commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

📚 Research & Knowledge (wenn applicable)

Checklist

Goal

Detailed Scope

Geo Point Clustering: DBSCAN and k-means

Acceptance Criteria

Relationships

References

Uh oh!

makr-code commented Mar 17, 2026

Uh oh!

Copilot AI commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

makr-code commented Mar 17, 2026

Uh oh!

Copilot AI commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 16, 2026 •

edited

Loading