-
Notifications
You must be signed in to change notification settings - Fork 19
perf: optimize bloom filter using double hashing and zlib #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,16 +1,20 @@ | ||||||||||||||||||||||||||||||||||||
| from __future__ import annotations | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| from hashlib import blake2b, sha256 | ||||||||||||||||||||||||||||||||||||
| import zlib | ||||||||||||||||||||||||||||||||||||
| from random import Random | ||||||||||||||||||||||||||||||||||||
| from typing import Any | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| from ...common import TaskSpec | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]: | ||||||||||||||||||||||||||||||||||||
| digest_a = int.from_bytes(blake2b(value.encode(), digest_size=8).digest(), "big") | ||||||||||||||||||||||||||||||||||||
| digest_b = int.from_bytes(sha256(value.encode()).digest()[:8], "big") | ||||||||||||||||||||||||||||||||||||
| return [((digest_a + index * digest_b) % bit_count) for index in range(hash_count)] | ||||||||||||||||||||||||||||||||||||
| raw = value.encode() | ||||||||||||||||||||||||||||||||||||
| h1 = zlib.crc32(raw) | ||||||||||||||||||||||||||||||||||||
| h2 = zlib.adler32(raw) | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1 | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| return [((h1 + i * step) % bit_count) for i in range(hash_count)] | ||||||||||||||||||||||||||||||||||||
|
Comment on lines
+15
to
+17
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Validate
Proposed fix def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
"""Computes probe positions for a Bloom filter using double hashing."""
+ if bit_count <= 0:
+ raise ValueError("bit_count must be > 0")
+ if hash_count <= 0:
+ raise ValueError("hash_count must be > 0")
+
raw = value.encode()
h1 = zlib.crc32(raw)
h2 = zlib.adler32(raw)
step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1
return [((h1 + i * step) % bit_count) for i in range(hash_count)]📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this double hashing suited for bloom filter?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I feel double hashing is a standard way to build Bloom filters. I used it because it lets us simulate as many hash functions as we need using just two base hashes, which basically cuts the workload down from O(k) to O(1) which makes it very fast. It also makes the code run way faster since we aren't doing 8 or 16 different hash calculations for every single item. |
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| def generate(count: int, seed: int) -> dict[str, Any]: | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have you decided to move with CRC32 as a hash function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose CRC32 because we don't need cryptographic security for a Bloom filter, just fast and uniform bit distribution. It’s hardware-accelerated in Python and much lighter than SHA-256 which is why we got that 2.09x speed.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But do you see there is slight decrease in reliability? Could you confirm that the false positives will plateau or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i have thought about this and i realized the tiny drop in reliability is just because CRC32 has slightly more collisions than SHA-256, but it's still way above our 85% floor. The false positives will definitely plateau because the error rate is tied to the bit-to-item ratio, which we have kept at 16x. Once the bitset reaches that density, the rate stabilizes and stays predictable rather than drifting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from where did you got the floor of 85%, a loss of 15% reliability is not a good option that too for a 2x speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh am sorry i did not convey it properly in the earlier message,the reliability did not drop by 15%. The actual benchmark showed a tiny delta of only -0.000504 (the overall reliability is still around 99.7%).I just mentioned 85% because reliability_floor=0.85 is hardcoded at the bottom of the task.py file,so basically we got 2.09x without hurting the accuracy at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but I am less satisfied with that, I am ok with merging it but as it would need more detailed testing as the distribution through the CRC32 and Adler32 tends to form clusters so I would be rating it as easy.