Optimize inverse index mapping in groups.py (Fixes Issue #3387)#5252
Optimize inverse index mapping in groups.py (Fixes Issue #3387)#5252Ayush-Agarwal-G1THUB wants to merge 3 commits intoMDAnalysis:developfrom
Conversation
There was a problem hiding this comment.
Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on GitHub Discussions so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5252 +/- ##
===========================================
+ Coverage 93.83% 93.85% +0.01%
===========================================
Files 180 180
Lines 22473 22471 -2
Branches 3189 3188 -1
===========================================
+ Hits 21088 21090 +2
+ Misses 923 920 -3
+ Partials 462 461 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hello! To any reviewers reading this This is my first contribution at this scale, so I’d really appreciate feedback on both the implementation and whether this PR follows the contributor guidelines. Thank you! |
Fixes #3387
Summary
This PR replaces the O(n²) inverse index reconstruction logic in groups.py with an O(n) implementation written in Cython.
The previous implementation used np.where inside a loop over unique indices, leading to quadratic complexity. The approach in this PR uses a dictionary-based lookup implemented in lib/_cutil.pyx, where the indices of the unique values in
indicesarray are added to a map. Thenself.ixarray is iterated over only once and the map is consulted to get the index from the uniqe value array.This means that the
self.ixarray is traversed in a single pass, making it linear time complexity.Changes made in this Pull Request:
LLM / AI generated code disclosure
LLMs or other AI-powered tools (beyond simple IDE use cases) were used in this contribution: no
Benchmark
For an array of size 1,000,000 with ~5000 unique values, following speedup was observed :
Python: ~0.41s
Cython: ~0.11s
Speedup: ~3.64 times
(these values are from local testing)
PR Checklist
package/CHANGELOGfile updated?package/AUTHORS? (If it is not, add it!)Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--5252.org.readthedocs.build/en/5252/