feat: add periodic boundary condition queries for float trees#297
feat: add periodic boundary condition queries for float trees#297helmutcarter wants to merge 5 commits intosdd:masterfrom
Conversation
…performance penalty
|
Hi @helmutcarter. Thanks for this PR - the feature request / issue on which it is based has been open for a number of years and I know there are certainly others that are keen to see this functionality! I've been travelling quite a bit these past couple of weeks so apologies for the delay in looking at this. It is quite a large PR so I'll need a bit of time to review. At first glance it looks as though there would be no impact to other existing users since the new PBC-related code does not touch the underlying data structures and is contained within the new query methods, which is good, but I'd also like to spend a bit of time this week to see if I can improve the performance. Thanks again! |
|
Hi @sdd thanks for taking a look! I was thinking more about the API this weekend and I was wondering if it would be better to have a periodic flag in existing functions, instead of creating new periodic versions of them. I don't know much about software design or best practices here, so please let me know what you think. Regarding performance, this is definitely a naive first-pass implementation. I had a few thoughts on improving performance, but I'm curious to see what you come up with! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #297 +/- ##
==========================================
+ Coverage 94.88% 95.56% +0.68%
==========================================
Files 54 54
Lines 5705 7032 +1327
Branches 5705 7032 +1327
==========================================
+ Hits 5413 6720 +1307
- Misses 274 289 +15
- Partials 18 23 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| .map(|(item, distance)| NearestNeighbour { distance, item }) | ||
| .collect(); | ||
|
|
||
| results.sort_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap()); |
There was a problem hiding this comment.
We can get better performance here by only sorting the results as far as qty. We don't care if the items beyond qty are sorted - only that their distance is > the closest quantity items. Also the unstable sort variants are more appropriate here as we don't care about preserving order of items with the same distance from the raw unsorted results.
You can do this by using a variant of select_nth_unstable. This ensures that the item in position n in the result is in it's final sorted position.
Older versions of select_nth_unstable had the side-effect that the "closer" side of the result array was also sorted after select_nth_unstable exited. I'm not 100% sure that this is still the case with the new ipnsort algorithm that Rust's standard library uses, so we may need to sort the "closer" side again after applying select_nth_unstable.
Something like this should be an improvement:
results.select_nth_unstable_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap());
results.truncate(qty);
// this step may not be needed if select_nth_unstable_by
// has the side-effect of sorting the closer side
results.sort_unstable_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap());
results| { | ||
| if axis == K { | ||
| for candidate in self.nearest_n::<D>(wrapped_query, qty) { | ||
| best_by_item |
There was a problem hiding this comment.
I suspect that this HashMap based approach is causing some performance issues here.
An alternative we could try here would be:
- perform all 3^K queries but keep them as-is, ie simple lists of
NearestNeighbourobjects - concatenate the lists and then sort unstable by item and then dist, so we get all results from repeated items adjacent to each other, with multiple results for the same item in ascending distance order
- walk across the vec, copying later items on top of earlier items to eliminate the more distant images for the same item. Something like this (I've not tested this, or even tried to compile it!):
results.sort_unstable_by_key(|a|a.item); // EDIT: this closure needs to use sort_by, not sort_by_key, and to cmp by distance when a.item == b.item
let mut from_idx = 1;
let mut to_idx = 1;
let curr_item = result[0].item;
while(from_idx < result.len()) {
if result[to_idx].item != curr_item {
result[from_idx] = result[to_idx];
curr_item = result[to_idx].item;
to_idx += 1;
}
from_idx += 1;
}
results.truncate(to_idx);Then try the same as the previous comment:
results.select_nth_unstable_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap());
results.truncate(qty);
// this step may not be needed if select_nth_unstable_by
// has the side-effect of sorting the closer side
results.sort_unstable_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap());
resultsI don't know for certain but my hunch is that this would be faster than the hashmap-based approach. Certainly worth a try.
Resolves #34 by adding periodic boundary condition support for float-tree queries on
KdTreeandImmutableKdTree.New periodic APIs:
nearest_one_periodicnearest_n_periodicwithin_periodicwithin_unsorted_periodicnearest_n_within_periodicIt also adds:
examples/READMEanddocs/Implementation
This is a correctness-first implementation. It is pretty slow.
Periodic queries work by evaluating wrapped query images, reusing the existing non-periodic query logic, then
merging duplicate hits and keeping the minimum distance per item.
Contract
Current assumptions:
Performance
The performance penalty is large, and grows quickly with increasing number of dimensions. With the current naive approach,
periodic query complexity is O(3^K). I think I know how to reduce this to O(2^K), but a pruning-aware periodic search may
obviate the need for optimizing the number of queries of periodic images.
Benchmark: