perf: optimize bloom filter using double hashing and zlib by Phantom0299 · Pull Request #36 · iiitl/chuck

Phantom0299 · 2026-04-11T18:18:38Z

Optimized the Bloom Filter by swapping slow cryptographic hashes (Blake2b/SHA256) for hardware-accelerated CRC32 and Adler32.

Key improvements:

Implemented Double Hashing ($h_i(x) = (h_1(x) + i \cdot h_2(x)) \pmod m$) to maintain accuracy while significantly reducing CPU overhead.
Minimized redundant work by encoding input strings only once per item.
Verified results via benchmark: processed 10,000 items in ~0.094s (approx. 106k ops/sec).

Logic verified with pytest; false positive rates remain within the reliability floor.

Summary by CodeRabbit

Refactor
- Swapped the internal hashing mechanism used by in-memory indexing for a faster checksum-based approach, reducing CPU during index operations.
- Preserved all public interfaces, outputs, and task behavior so user-facing functionality and compatibility remain unchanged.

Fixes #7

coderabbitai · 2026-04-11T18:18:58Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

The change replaces cryptographic hash functions (blake2b, sha256) with non-cryptographic checksums (zlib.crc32, zlib.adler32) in the Bloom-filter probe-position computation, deriving multiple probe positions from two 32-bit checksums via linear combination modulo bit count.

Changes

Cohort / File(s)	Summary
Bloom Filter Hash Replacement `chuck/tasks/memory_index/task.py`	Replaced cryptographic digests with `zlib.crc32` and `zlib.adler32`; `_bloom_hashes` now encodes input once, derives two 32-bit hashes, computes a dynamic step (`1 + (h2 % (bit_count - 1))` when `bit_count > 1`), and yields `hash_count` probe positions via `((h1 + i * step) % bit_count)`. Imports adjusted; `generate`/`solve` control flow and signatures unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: replacing cryptographic hashes with zlib-based non-cryptographic hashing and implementing double hashing for the Bloom filter to improve performance.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

chuck/tasks/memory_index/task.py (1)

13-14: Avoid zero-step double hashing to prevent degenerate single-bit probing.

When h2 % bit_count == 0, all hash_count probes collapse to one position for that key. A non-zero normalized step keeps probe spread consistent.

♻️ Suggested change

 def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
     raw = value.encode()
     h1 = zlib.crc32(raw)
     h2 = zlib.adler32(raw)
-    return [((h1 + i * h2) % bit_count) for i in range(hash_count)]
+    step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1
+    return [((h1 + i * step) % bit_count) for i in range(hash_count)]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@chuck/tasks/memory_index/task.py` around lines 13 - 14, The double-hash probe
currently computes h2 = zlib.adler32(raw) and uses i * h2 directly, which can
produce a zero step when h2 % bit_count == 0 and collapse all probes; change the
normalization so the step is always non-zero (e.g., compute a step = (h2 %
(bit_count - 1)) + 1 or, at minimum, if (h2 % bit_count) == 0 set step = 1) and
then generate probes using ((h1 + i * step) % bit_count) for i in
range(hash_count) to ensure probing spreads across multiple positions (update
the code around h1, h2, bit_count, hash_count).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chuck/tasks/memory_index/task.py`:
- Line 79: Add a single trailing newline at the end of the file to satisfy
pre-commit's end-of-file-fixer; locate the file that ends with the lone closing
parenthesis ")" (the end of the block/function in task.py) and ensure the file
ends with a newline character so CI stops reporting the EOF-fixer failure.
- Around line 12-14: The change to the hashing probe order (using crc32/adler32
via the code in chuck/tasks/memory_index/task.py that computes h1/h2 and returns
the probe list) alters probe distribution and will break the strict equality
check in chuck/regression.py against data/memory_index/regression.json; fix by
either regenerating the expected metrics in data/memory_index/regression.json to
match the new hashing outputs produced by the memory_index task, or modify
chuck/regression.py to compare probabilistic or derived fields with a numeric
tolerance (instead of strict equality) for the affected fields produced by the
probe/hash logic.

---

Nitpick comments:
In `@chuck/tasks/memory_index/task.py`:
- Around line 13-14: The double-hash probe currently computes h2 =
zlib.adler32(raw) and uses i * h2 directly, which can produce a zero step when
h2 % bit_count == 0 and collapse all probes; change the normalization so the
step is always non-zero (e.g., compute a step = (h2 % (bit_count - 1)) + 1 or,
at minimum, if (h2 % bit_count) == 0 set step = 1) and then generate probes
using ((h1 + i * step) % bit_count) for i in range(hash_count) to ensure probing
spreads across multiple positions (update the code around h1, h2, bit_count,
hash_count).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82ad2d49-e080-4769-97c9-bdeaef0a2e3d

📥 Commits

Reviewing files that changed from the base of the PR and between a302e5f and a0f1bff.

📒 Files selected for processing (1)

chuck/tasks/memory_index/task.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chuck/tasks/memory_index/task.py`:
- Around line 16-18: Validate and reject invalid parameters before computing
probe positions: check that bit_count > 0 and hash_count > 0 in the function
that computes the probe list (the code using bit_count, hash_count, h1, h2 and
producing the list comprehension), and if either is invalid raise a ValueError
(or return a clear error) so you don't perform modulo by zero or return an empty
list; update callers (e.g., solve) to handle the error if necessary and ensure
the step computation and list comprehension only run when bit_count > 0 and
hash_count > 0.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 96835de0-18cd-45f3-820a-2bfa7a5d454b

📥 Commits

Reviewing files that changed from the base of the PR and between 57cb7cd and 2e273af.

📒 Files selected for processing (1)

chuck/tasks/memory_index/task.py

coderabbitai · 2026-04-11T18:58:09Z

+    step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1
+
+    return [((h1 + i * step) % bit_count) for i in range(hash_count)]


⚠️ Potential issue | 🟠 Major

Validate bit_count and hash_count before generating probe positions.

bit_count <= 0 can break modulo math, and hash_count <= 0 yields an empty probe list, which makes all(...) true for every query in solve and inflates positives.

Proposed fix

def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]: """Computes probe positions for a Bloom filter using double hashing.""" + if bit_count <= 0: + raise ValueError("bit_count must be > 0") + if hash_count <= 0: + raise ValueError("hash_count must be > 0") + raw = value.encode() h1 = zlib.crc32(raw) h2 = zlib.adler32(raw) step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1 return [((h1 + i * step) % bit_count) for i in range(hash_count)]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1

return [((h1 + i * step) % bit_count) for i in range(hash_count)]

def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:

"""Computes probe positions for a Bloom filter using double hashing."""

if bit_count <= 0:

raise ValueError("bit_count must be > 0")

if hash_count <= 0:

raise ValueError("hash_count must be > 0")

raw = value.encode()

h1 = zlib.crc32(raw)

h2 = zlib.adler32(raw)

step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1

return [((h1 + i * step) % bit_count) for i in range(hash_count)]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@chuck/tasks/memory_index/task.py` around lines 16 - 18, Validate and reject invalid parameters before computing probe positions: check that bit_count > 0 and hash_count > 0 in the function that computes the probe list (the code using bit_count, hash_count, h1, h2 and producing the list comprehension), and if either is invalid raise a ValueError (or return a clear error) so you don't perform modulo by zero or return an empty list; update callers (e.g., solve) to handle the error if necessary and ensure the step computation and list comprehension only run when bit_count > 0 and hash_count > 0.

Aaryan-Dadu · 2026-04-11T19:37:06Z

Please go through the contributors guideline properly. Specifically, https://github.com/iiitl/chuck/blob/main/CONTRIBUTING.md#ab-comparison-workflow

Phantom0299 · 2026-04-11T20:09:40Z

A/B Comparison Results (as per contributing guidelines):

old snapshot: original SHA256/Blake2b implementation
new snapshot: optimized CRC32/Adler32 with double hashing

memory_index: 2.09x speedup (52.23% faster)

before: 0.050243s
after: 0.024003s
reliability delta: -0.000504 (within acceptable floor)

All other tasks unaffected.

Comparison command used:
python -m chuck compare --old data/reports/snapshots/old.json --new data/reports/snapshots/current.json

Aaryan-Dadu · 2026-04-11T20:21:18Z

-    return [((digest_a + index * digest_b) % bit_count) for index in range(hash_count)]
+    """Computes probe positions for a Bloom filter using double hashing."""
+    raw = value.encode()
+    h1 = zlib.crc32(raw)


Why have you decided to move with CRC32 as a hash function?

I chose CRC32 because we don't need cryptographic security for a Bloom filter, just fast and uniform bit distribution. It’s hardware-accelerated in Python and much lighter than SHA-256 which is why we got that 2.09x speed.

But do you see there is slight decrease in reliability? Could you confirm that the false positives will plateau or not.

Yeah i have thought about this and i realized the tiny drop in reliability is just because CRC32 has slightly more collisions than SHA-256, but it's still way above our 85% floor. The false positives will definitely plateau because the error rate is tied to the bit-to-item ratio, which we have kept at 16x. Once the bitset reaches that density, the rate stabilizes and stays predictable rather than drifting.

from where did you got the floor of 85%, a loss of 15% reliability is not a good option that too for a 2x speedup.

Oh am sorry i did not convey it properly in the earlier message,the reliability did not drop by 15%. The actual benchmark showed a tiny delta of only -0.000504 (the overall reliability is still around 99.7%).I just mentioned 85% because reliability_floor=0.85 is hardcoded at the bottom of the task.py file,so basically we got 2.09x without hurting the accuracy at all.

Sorry but I am less satisfied with that, I am ok with merging it but as it would need more detailed testing as the distribution through the CRC32 and Adler32 tends to form clusters so I would be rating it as easy.

Aaryan-Dadu · 2026-04-11T20:24:28Z

+
+    step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1
+
+    return [((h1 + i * step) % bit_count) for i in range(hash_count)]


Is this double hashing suited for bloom filter?

Yeah, I feel double hashing is a standard way to build Bloom filters. I used it because it lets us simulate as many hash functions as we need using just two base hashes, which basically cuts the workload down from O(k) to O(1) which makes it very fast. It also makes the code run way faster since we aren't doing 8 or 16 different hash calculations for every single item.

Aaryan-Dadu · 2026-04-11T21:07:14Z

Please add Fixes #Issue-Number in your PR description so that the issue gets linked to the PR.

coderabbitai

♻️ Duplicate comments (1)

chuck/tasks/memory_index/task.py (1)

12-18: ⚠️ Potential issue | 🟠 Major

Validate Bloom filter parameters before probe generation.

bit_count <= 0 can crash modulo math, and hash_count <= 0 yields an empty probe list that makes all(...) in solve always true for probes.

Proposed fix

 def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
-    
+    if bit_count <= 0:
+        raise ValueError("bit_count must be > 0")
+    if hash_count <= 0:
+        raise ValueError("hash_count must be > 0")
+
     raw = value.encode()
     h1 = zlib.crc32(raw)
     h2 = zlib.adler32(raw)

     step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1

     return [((h1 + i * step) % bit_count) for i in range(hash_count)]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@chuck/tasks/memory_index/task.py` around lines 12 - 18, The probe-generation
code doesn't validate inputs and can crash or produce an empty probe list; add
validation at the start of the probe-generation function (the function that
encodes value and computes h1, h2, step and returns the list of probes) to
ensure bit_count > 0 and hash_count > 0 (and that both are integers), and raise
a clear ValueError if they are invalid; keep the existing step logic (1 + (h2 %
(bit_count - 1)) if bit_count > 1 else 1) and then compute and return the probes
as before.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@chuck/tasks/memory_index/task.py`:
- Around line 12-18: The probe-generation code doesn't validate inputs and can
crash or produce an empty probe list; add validation at the start of the
probe-generation function (the function that encodes value and computes h1, h2,
step and returns the list of probes) to ensure bit_count > 0 and hash_count > 0
(and that both are integers), and raise a clear ValueError if they are invalid;
keep the existing step logic (1 + (h2 % (bit_count - 1)) if bit_count > 1 else
1) and then compute and return the probes as before.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ad2b0ba0-9e35-4e24-be45-9e762770de62

📥 Commits

Reviewing files that changed from the base of the PR and between 2e273af and bd87290.

📒 Files selected for processing (1)

chuck/tasks/memory_index/task.py

Aaryan-Dadu · 2026-04-12T05:26:11Z

Thanks for the PR!

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread chuck/tasks/memory_index/task.py Outdated

Comment thread chuck/tasks/memory_index/task.py Outdated

perf: optimize bloom filter using double hashing and zlib

2e273af

Phantom0299 force-pushed the perf/bloom-filter-optimization branch from ee3f190 to 2e273af Compare April 11, 2026 18:55

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Aaryan-Dadu requested changes Apr 11, 2026

View reviewed changes

Phantom0299 added 2 commits April 12, 2026 02:52

style: remove docstrings

bd87290

style: fix trailing whitespace errors from docstring removal

0641e1c

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

mini-walkerx Bot mentioned this pull request Apr 11, 2026

Replace per-item sha256 + blake2b in memory_index bloom filter #7

Closed

Aaryan-Dadu changed the base branch from main to dev April 12, 2026 05:26

Aaryan-Dadu added the accepted-30 label Apr 12, 2026

Aaryan-Dadu merged commit 1c5384e into iiitl:dev Apr 12, 2026
2 checks passed

		step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1

		return [((h1 + i * step) % bit_count) for i in range(hash_count)]

Conversation

Phantom0299 commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Aaryan-Dadu commented Apr 11, 2026

Uh oh!

Phantom0299 commented Apr 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aaryan-Dadu Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Aaryan-Dadu commented Apr 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Aaryan-Dadu commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Phantom0299 commented Apr 11, 2026 •

edited

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading

Aaryan-Dadu Apr 11, 2026 •

edited

Loading