Skip to content

perf: optimize bloom filter using double hashing and zlib#36

Merged
Aaryan-Dadu merged 3 commits into
iiitl:devfrom
Phantom0299:perf/bloom-filter-optimization
Apr 12, 2026
Merged

perf: optimize bloom filter using double hashing and zlib#36
Aaryan-Dadu merged 3 commits into
iiitl:devfrom
Phantom0299:perf/bloom-filter-optimization

Conversation

@Phantom0299
Copy link
Copy Markdown

@Phantom0299 Phantom0299 commented Apr 11, 2026

Optimized the Bloom Filter by swapping slow cryptographic hashes (Blake2b/SHA256) for hardware-accelerated CRC32 and Adler32.

Key improvements:

  • Implemented Double Hashing ($h_i(x) = (h_1(x) + i \cdot h_2(x)) \pmod m$) to maintain accuracy while significantly reducing CPU overhead.
  • Minimized redundant work by encoding input strings only once per item.
  • Verified results via benchmark: processed 10,000 items in ~0.094s (approx. 106k ops/sec).

Logic verified with pytest; false positive rates remain within the reliability floor.
Screenshot 2026-04-11 234355

Summary by CodeRabbit

  • Refactor
    • Swapped the internal hashing mechanism used by in-memory indexing for a faster checksum-based approach, reducing CPU during index operations.
    • Preserved all public interfaces, outputs, and task behavior so user-facing functionality and compatibility remain unchanged.

Fixes #7

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The change replaces cryptographic hash functions (blake2b, sha256) with non-cryptographic checksums (zlib.crc32, zlib.adler32) in the Bloom-filter probe-position computation, deriving multiple probe positions from two 32-bit checksums via linear combination modulo bit count.

Changes

Cohort / File(s) Summary
Bloom Filter Hash Replacement
chuck/tasks/memory_index/task.py
Replaced cryptographic digests with zlib.crc32 and zlib.adler32; _bloom_hashes now encodes input once, derives two 32-bit hashes, computes a dynamic step (1 + (h2 % (bit_count - 1)) when bit_count > 1), and yields hash_count probe positions via ((h1 + i * step) % bit_count). Imports adjusted; generate/solve control flow and signatures unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: replacing cryptographic hashes with zlib-based non-cryptographic hashing and implementing double hashing for the Bloom filter to improve performance.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
chuck/tasks/memory_index/task.py (1)

13-14: Avoid zero-step double hashing to prevent degenerate single-bit probing.

When h2 % bit_count == 0, all hash_count probes collapse to one position for that key. A non-zero normalized step keeps probe spread consistent.

♻️ Suggested change
 def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
     raw = value.encode()
     h1 = zlib.crc32(raw)
     h2 = zlib.adler32(raw)
-    return [((h1 + i * h2) % bit_count) for i in range(hash_count)]
+    step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1
+    return [((h1 + i * step) % bit_count) for i in range(hash_count)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chuck/tasks/memory_index/task.py` around lines 13 - 14, The double-hash probe
currently computes h2 = zlib.adler32(raw) and uses i * h2 directly, which can
produce a zero step when h2 % bit_count == 0 and collapse all probes; change the
normalization so the step is always non-zero (e.g., compute a step = (h2 %
(bit_count - 1)) + 1 or, at minimum, if (h2 % bit_count) == 0 set step = 1) and
then generate probes using ((h1 + i * step) % bit_count) for i in
range(hash_count) to ensure probing spreads across multiple positions (update
the code around h1, h2, bit_count, hash_count).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chuck/tasks/memory_index/task.py`:
- Line 79: Add a single trailing newline at the end of the file to satisfy
pre-commit's end-of-file-fixer; locate the file that ends with the lone closing
parenthesis ")" (the end of the block/function in task.py) and ensure the file
ends with a newline character so CI stops reporting the EOF-fixer failure.
- Around line 12-14: The change to the hashing probe order (using crc32/adler32
via the code in chuck/tasks/memory_index/task.py that computes h1/h2 and returns
the probe list) alters probe distribution and will break the strict equality
check in chuck/regression.py against data/memory_index/regression.json; fix by
either regenerating the expected metrics in data/memory_index/regression.json to
match the new hashing outputs produced by the memory_index task, or modify
chuck/regression.py to compare probabilistic or derived fields with a numeric
tolerance (instead of strict equality) for the affected fields produced by the
probe/hash logic.

---

Nitpick comments:
In `@chuck/tasks/memory_index/task.py`:
- Around line 13-14: The double-hash probe currently computes h2 =
zlib.adler32(raw) and uses i * h2 directly, which can produce a zero step when
h2 % bit_count == 0 and collapse all probes; change the normalization so the
step is always non-zero (e.g., compute a step = (h2 % (bit_count - 1)) + 1 or,
at minimum, if (h2 % bit_count) == 0 set step = 1) and then generate probes
using ((h1 + i * step) % bit_count) for i in range(hash_count) to ensure probing
spreads across multiple positions (update the code around h1, h2, bit_count,
hash_count).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82ad2d49-e080-4769-97c9-bdeaef0a2e3d

📥 Commits

Reviewing files that changed from the base of the PR and between a302e5f and a0f1bff.

📒 Files selected for processing (1)
  • chuck/tasks/memory_index/task.py

Comment thread chuck/tasks/memory_index/task.py Outdated
Comment thread chuck/tasks/memory_index/task.py Outdated
@Phantom0299 Phantom0299 force-pushed the perf/bloom-filter-optimization branch from ee3f190 to 2e273af Compare April 11, 2026 18:55
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chuck/tasks/memory_index/task.py`:
- Around line 16-18: Validate and reject invalid parameters before computing
probe positions: check that bit_count > 0 and hash_count > 0 in the function
that computes the probe list (the code using bit_count, hash_count, h1, h2 and
producing the list comprehension), and if either is invalid raise a ValueError
(or return a clear error) so you don't perform modulo by zero or return an empty
list; update callers (e.g., solve) to handle the error if necessary and ensure
the step computation and list comprehension only run when bit_count > 0 and
hash_count > 0.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 96835de0-18cd-45f3-820a-2bfa7a5d454b

📥 Commits

Reviewing files that changed from the base of the PR and between 57cb7cd and 2e273af.

📒 Files selected for processing (1)
  • chuck/tasks/memory_index/task.py

Comment on lines +16 to +18
step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1

return [((h1 + i * step) % bit_count) for i in range(hash_count)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate bit_count and hash_count before generating probe positions.

bit_count <= 0 can break modulo math, and hash_count <= 0 yields an empty probe list, which makes all(...) true for every query in solve and inflates positives.

Proposed fix
 def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
     """Computes probe positions for a Bloom filter using double hashing."""
+    if bit_count <= 0:
+        raise ValueError("bit_count must be > 0")
+    if hash_count <= 0:
+        raise ValueError("hash_count must be > 0")
+
     raw = value.encode()
     h1 = zlib.crc32(raw)
     h2 = zlib.adler32(raw)

     step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1

     return [((h1 + i * step) % bit_count) for i in range(hash_count)]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1
return [((h1 + i * step) % bit_count) for i in range(hash_count)]
def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
"""Computes probe positions for a Bloom filter using double hashing."""
if bit_count <= 0:
raise ValueError("bit_count must be > 0")
if hash_count <= 0:
raise ValueError("hash_count must be > 0")
raw = value.encode()
h1 = zlib.crc32(raw)
h2 = zlib.adler32(raw)
step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1
return [((h1 + i * step) % bit_count) for i in range(hash_count)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chuck/tasks/memory_index/task.py` around lines 16 - 18, Validate and reject
invalid parameters before computing probe positions: check that bit_count > 0
and hash_count > 0 in the function that computes the probe list (the code using
bit_count, hash_count, h1, h2 and producing the list comprehension), and if
either is invalid raise a ValueError (or return a clear error) so you don't
perform modulo by zero or return an empty list; update callers (e.g., solve) to
handle the error if necessary and ensure the step computation and list
comprehension only run when bit_count > 0 and hash_count > 0.

@Aaryan-Dadu
Copy link
Copy Markdown
Member

Please go through the contributors guideline properly. Specifically, https://github.com/iiitl/chuck/blob/main/CONTRIBUTING.md#ab-comparison-workflow

@Phantom0299
Copy link
Copy Markdown
Author

A/B Comparison Results (as per contributing guidelines):

  • old snapshot: original SHA256/Blake2b implementation
  • new snapshot: optimized CRC32/Adler32 with double hashing

memory_index: 2.09x speedup (52.23% faster)

  • before: 0.050243s
  • after: 0.024003s
  • reliability delta: -0.000504 (within acceptable floor)

All other tasks unaffected.
image

Comparison command used:
python -m chuck compare --old data/reports/snapshots/old.json --new data/reports/snapshots/current.json

return [((digest_a + index * digest_b) % bit_count) for index in range(hash_count)]
"""Computes probe positions for a Bloom filter using double hashing."""
raw = value.encode()
h1 = zlib.crc32(raw)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have you decided to move with CRC32 as a hash function?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose CRC32 because we don't need cryptographic security for a Bloom filter, just fast and uniform bit distribution. It’s hardware-accelerated in Python and much lighter than SHA-256 which is why we got that 2.09x speed.

Copy link
Copy Markdown
Member

@Aaryan-Dadu Aaryan-Dadu Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do you see there is slight decrease in reliability? Could you confirm that the false positives will plateau or not.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i have thought about this and i realized the tiny drop in reliability is just because CRC32 has slightly more collisions than SHA-256, but it's still way above our 85% floor. The false positives will definitely plateau because the error rate is tied to the bit-to-item ratio, which we have kept at 16x. Once the bitset reaches that density, the rate stabilizes and stays predictable rather than drifting.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from where did you got the floor of 85%, a loss of 15% reliability is not a good option that too for a 2x speedup.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh am sorry i did not convey it properly in the earlier message,the reliability did not drop by 15%. The actual benchmark showed a tiny delta of only -0.000504 (the overall reliability is still around 99.7%).I just mentioned 85% because reliability_floor=0.85 is hardcoded at the bottom of the task.py file,so basically we got 2.09x without hurting the accuracy at all.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I am less satisfied with that, I am ok with merging it but as it would need more detailed testing as the distribution through the CRC32 and Adler32 tends to form clusters so I would be rating it as easy.


step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1

return [((h1 + i * step) % bit_count) for i in range(hash_count)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this double hashing suited for bloom filter?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I feel double hashing is a standard way to build Bloom filters. I used it because it lets us simulate as many hash functions as we need using just two base hashes, which basically cuts the workload down from O(k) to O(1) which makes it very fast. It also makes the code run way faster since we aren't doing 8 or 16 different hash calculations for every single item.

Comment thread chuck/tasks/memory_index/task.py Outdated
@Aaryan-Dadu
Copy link
Copy Markdown
Member

Please add Fixes #Issue-Number in your PR description so that the issue gets linked to the PR.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
chuck/tasks/memory_index/task.py (1)

12-18: ⚠️ Potential issue | 🟠 Major

Validate Bloom filter parameters before probe generation.

bit_count <= 0 can crash modulo math, and hash_count <= 0 yields an empty probe list that makes all(...) in solve always true for probes.

Proposed fix
 def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
-    
+    if bit_count <= 0:
+        raise ValueError("bit_count must be > 0")
+    if hash_count <= 0:
+        raise ValueError("hash_count must be > 0")
+
     raw = value.encode()
     h1 = zlib.crc32(raw)
     h2 = zlib.adler32(raw)

     step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1

     return [((h1 + i * step) % bit_count) for i in range(hash_count)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chuck/tasks/memory_index/task.py` around lines 12 - 18, The probe-generation
code doesn't validate inputs and can crash or produce an empty probe list; add
validation at the start of the probe-generation function (the function that
encodes value and computes h1, h2, step and returns the list of probes) to
ensure bit_count > 0 and hash_count > 0 (and that both are integers), and raise
a clear ValueError if they are invalid; keep the existing step logic (1 + (h2 %
(bit_count - 1)) if bit_count > 1 else 1) and then compute and return the probes
as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@chuck/tasks/memory_index/task.py`:
- Around line 12-18: The probe-generation code doesn't validate inputs and can
crash or produce an empty probe list; add validation at the start of the
probe-generation function (the function that encodes value and computes h1, h2,
step and returns the list of probes) to ensure bit_count > 0 and hash_count > 0
(and that both are integers), and raise a clear ValueError if they are invalid;
keep the existing step logic (1 + (h2 % (bit_count - 1)) if bit_count > 1 else
1) and then compute and return the probes as before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ad2b0ba0-9e35-4e24-be45-9e762770de62

📥 Commits

Reviewing files that changed from the base of the PR and between 2e273af and bd87290.

📒 Files selected for processing (1)
  • chuck/tasks/memory_index/task.py

@Aaryan-Dadu
Copy link
Copy Markdown
Member

Thanks for the PR!

@Aaryan-Dadu Aaryan-Dadu changed the base branch from main to dev April 12, 2026 05:26
@Aaryan-Dadu Aaryan-Dadu merged commit 1c5384e into iiitl:dev Apr 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace per-item sha256 + blake2b in memory_index bloom filter

2 participants