perf: optimize bloom filter using double hashing and zlib#36
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThe change replaces cryptographic hash functions ( Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
chuck/tasks/memory_index/task.py (1)
13-14: Avoid zero-step double hashing to prevent degenerate single-bit probing.When
h2 % bit_count == 0, allhash_countprobes collapse to one position for that key. A non-zero normalized step keeps probe spread consistent.♻️ Suggested change
def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]: raw = value.encode() h1 = zlib.crc32(raw) h2 = zlib.adler32(raw) - return [((h1 + i * h2) % bit_count) for i in range(hash_count)] + step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1 + return [((h1 + i * step) % bit_count) for i in range(hash_count)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@chuck/tasks/memory_index/task.py` around lines 13 - 14, The double-hash probe currently computes h2 = zlib.adler32(raw) and uses i * h2 directly, which can produce a zero step when h2 % bit_count == 0 and collapse all probes; change the normalization so the step is always non-zero (e.g., compute a step = (h2 % (bit_count - 1)) + 1 or, at minimum, if (h2 % bit_count) == 0 set step = 1) and then generate probes using ((h1 + i * step) % bit_count) for i in range(hash_count) to ensure probing spreads across multiple positions (update the code around h1, h2, bit_count, hash_count).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@chuck/tasks/memory_index/task.py`:
- Line 79: Add a single trailing newline at the end of the file to satisfy
pre-commit's end-of-file-fixer; locate the file that ends with the lone closing
parenthesis ")" (the end of the block/function in task.py) and ensure the file
ends with a newline character so CI stops reporting the EOF-fixer failure.
- Around line 12-14: The change to the hashing probe order (using crc32/adler32
via the code in chuck/tasks/memory_index/task.py that computes h1/h2 and returns
the probe list) alters probe distribution and will break the strict equality
check in chuck/regression.py against data/memory_index/regression.json; fix by
either regenerating the expected metrics in data/memory_index/regression.json to
match the new hashing outputs produced by the memory_index task, or modify
chuck/regression.py to compare probabilistic or derived fields with a numeric
tolerance (instead of strict equality) for the affected fields produced by the
probe/hash logic.
---
Nitpick comments:
In `@chuck/tasks/memory_index/task.py`:
- Around line 13-14: The double-hash probe currently computes h2 =
zlib.adler32(raw) and uses i * h2 directly, which can produce a zero step when
h2 % bit_count == 0 and collapse all probes; change the normalization so the
step is always non-zero (e.g., compute a step = (h2 % (bit_count - 1)) + 1 or,
at minimum, if (h2 % bit_count) == 0 set step = 1) and then generate probes
using ((h1 + i * step) % bit_count) for i in range(hash_count) to ensure probing
spreads across multiple positions (update the code around h1, h2, bit_count,
hash_count).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 82ad2d49-e080-4769-97c9-bdeaef0a2e3d
📒 Files selected for processing (1)
chuck/tasks/memory_index/task.py
ee3f190 to
2e273af
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@chuck/tasks/memory_index/task.py`:
- Around line 16-18: Validate and reject invalid parameters before computing
probe positions: check that bit_count > 0 and hash_count > 0 in the function
that computes the probe list (the code using bit_count, hash_count, h1, h2 and
producing the list comprehension), and if either is invalid raise a ValueError
(or return a clear error) so you don't perform modulo by zero or return an empty
list; update callers (e.g., solve) to handle the error if necessary and ensure
the step computation and list comprehension only run when bit_count > 0 and
hash_count > 0.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 96835de0-18cd-45f3-820a-2bfa7a5d454b
📒 Files selected for processing (1)
chuck/tasks/memory_index/task.py
| step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1 | ||
|
|
||
| return [((h1 + i * step) % bit_count) for i in range(hash_count)] |
There was a problem hiding this comment.
Validate bit_count and hash_count before generating probe positions.
bit_count <= 0 can break modulo math, and hash_count <= 0 yields an empty probe list, which makes all(...) true for every query in solve and inflates positives.
Proposed fix
def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
"""Computes probe positions for a Bloom filter using double hashing."""
+ if bit_count <= 0:
+ raise ValueError("bit_count must be > 0")
+ if hash_count <= 0:
+ raise ValueError("hash_count must be > 0")
+
raw = value.encode()
h1 = zlib.crc32(raw)
h2 = zlib.adler32(raw)
step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1
return [((h1 + i * step) % bit_count) for i in range(hash_count)]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1 | |
| return [((h1 + i * step) % bit_count) for i in range(hash_count)] | |
| def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]: | |
| """Computes probe positions for a Bloom filter using double hashing.""" | |
| if bit_count <= 0: | |
| raise ValueError("bit_count must be > 0") | |
| if hash_count <= 0: | |
| raise ValueError("hash_count must be > 0") | |
| raw = value.encode() | |
| h1 = zlib.crc32(raw) | |
| h2 = zlib.adler32(raw) | |
| step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1 | |
| return [((h1 + i * step) % bit_count) for i in range(hash_count)] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chuck/tasks/memory_index/task.py` around lines 16 - 18, Validate and reject
invalid parameters before computing probe positions: check that bit_count > 0
and hash_count > 0 in the function that computes the probe list (the code using
bit_count, hash_count, h1, h2 and producing the list comprehension), and if
either is invalid raise a ValueError (or return a clear error) so you don't
perform modulo by zero or return an empty list; update callers (e.g., solve) to
handle the error if necessary and ensure the step computation and list
comprehension only run when bit_count > 0 and hash_count > 0.
|
Please go through the contributors guideline properly. Specifically, https://github.com/iiitl/chuck/blob/main/CONTRIBUTING.md#ab-comparison-workflow |
|
A/B Comparison Results (as per contributing guidelines):
memory_index: 2.09x speedup (52.23% faster)
Comparison command used: |
| return [((digest_a + index * digest_b) % bit_count) for index in range(hash_count)] | ||
| """Computes probe positions for a Bloom filter using double hashing.""" | ||
| raw = value.encode() | ||
| h1 = zlib.crc32(raw) |
There was a problem hiding this comment.
Why have you decided to move with CRC32 as a hash function?
There was a problem hiding this comment.
I chose CRC32 because we don't need cryptographic security for a Bloom filter, just fast and uniform bit distribution. It’s hardware-accelerated in Python and much lighter than SHA-256 which is why we got that 2.09x speed.
There was a problem hiding this comment.
But do you see there is slight decrease in reliability? Could you confirm that the false positives will plateau or not.
There was a problem hiding this comment.
Yeah i have thought about this and i realized the tiny drop in reliability is just because CRC32 has slightly more collisions than SHA-256, but it's still way above our 85% floor. The false positives will definitely plateau because the error rate is tied to the bit-to-item ratio, which we have kept at 16x. Once the bitset reaches that density, the rate stabilizes and stays predictable rather than drifting.
There was a problem hiding this comment.
from where did you got the floor of 85%, a loss of 15% reliability is not a good option that too for a 2x speedup.
There was a problem hiding this comment.
Oh am sorry i did not convey it properly in the earlier message,the reliability did not drop by 15%. The actual benchmark showed a tiny delta of only -0.000504 (the overall reliability is still around 99.7%).I just mentioned 85% because reliability_floor=0.85 is hardcoded at the bottom of the task.py file,so basically we got 2.09x without hurting the accuracy at all.
There was a problem hiding this comment.
Sorry but I am less satisfied with that, I am ok with merging it but as it would need more detailed testing as the distribution through the CRC32 and Adler32 tends to form clusters so I would be rating it as easy.
|
|
||
| step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1 | ||
|
|
||
| return [((h1 + i * step) % bit_count) for i in range(hash_count)] |
There was a problem hiding this comment.
Is this double hashing suited for bloom filter?
There was a problem hiding this comment.
Yeah, I feel double hashing is a standard way to build Bloom filters. I used it because it lets us simulate as many hash functions as we need using just two base hashes, which basically cuts the workload down from O(k) to O(1) which makes it very fast. It also makes the code run way faster since we aren't doing 8 or 16 different hash calculations for every single item.
|
Please add Fixes #Issue-Number in your PR description so that the issue gets linked to the PR. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
chuck/tasks/memory_index/task.py (1)
12-18:⚠️ Potential issue | 🟠 MajorValidate Bloom filter parameters before probe generation.
bit_count <= 0can crash modulo math, andhash_count <= 0yields an empty probe list that makesall(...)insolvealways true for probes.Proposed fix
def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]: - + if bit_count <= 0: + raise ValueError("bit_count must be > 0") + if hash_count <= 0: + raise ValueError("hash_count must be > 0") + raw = value.encode() h1 = zlib.crc32(raw) h2 = zlib.adler32(raw) step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1 return [((h1 + i * step) % bit_count) for i in range(hash_count)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@chuck/tasks/memory_index/task.py` around lines 12 - 18, The probe-generation code doesn't validate inputs and can crash or produce an empty probe list; add validation at the start of the probe-generation function (the function that encodes value and computes h1, h2, step and returns the list of probes) to ensure bit_count > 0 and hash_count > 0 (and that both are integers), and raise a clear ValueError if they are invalid; keep the existing step logic (1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1) and then compute and return the probes as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@chuck/tasks/memory_index/task.py`:
- Around line 12-18: The probe-generation code doesn't validate inputs and can
crash or produce an empty probe list; add validation at the start of the
probe-generation function (the function that encodes value and computes h1, h2,
step and returns the list of probes) to ensure bit_count > 0 and hash_count > 0
(and that both are integers), and raise a clear ValueError if they are invalid;
keep the existing step logic (1 + (h2 % (bit_count - 1)) if bit_count > 1 else
1) and then compute and return the probes as before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ad2b0ba0-9e35-4e24-be45-9e762770de62
📒 Files selected for processing (1)
chuck/tasks/memory_index/task.py
|
Thanks for the PR! |

Optimized the Bloom Filter by swapping slow cryptographic hashes (Blake2b/SHA256) for hardware-accelerated CRC32 and Adler32.
Key improvements:
Logic verified with pytest; false positive rates remain within the reliability floor.

Summary by CodeRabbit
Fixes #7