Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions chuck/tasks/memory_index/task.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,20 @@
from __future__ import annotations

from hashlib import blake2b, sha256
import zlib
from random import Random
from typing import Any

from ...common import TaskSpec


def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
digest_a = int.from_bytes(blake2b(value.encode(), digest_size=8).digest(), "big")
digest_b = int.from_bytes(sha256(value.encode()).digest()[:8], "big")
return [((digest_a + index * digest_b) % bit_count) for index in range(hash_count)]
raw = value.encode()
h1 = zlib.crc32(raw)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have you decided to move with CRC32 as a hash function?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose CRC32 because we don't need cryptographic security for a Bloom filter, just fast and uniform bit distribution. It’s hardware-accelerated in Python and much lighter than SHA-256 which is why we got that 2.09x speed.

Copy link
Copy Markdown
Member

@Aaryan-Dadu Aaryan-Dadu Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do you see there is slight decrease in reliability? Could you confirm that the false positives will plateau or not.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i have thought about this and i realized the tiny drop in reliability is just because CRC32 has slightly more collisions than SHA-256, but it's still way above our 85% floor. The false positives will definitely plateau because the error rate is tied to the bit-to-item ratio, which we have kept at 16x. Once the bitset reaches that density, the rate stabilizes and stays predictable rather than drifting.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from where did you got the floor of 85%, a loss of 15% reliability is not a good option that too for a 2x speedup.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh am sorry i did not convey it properly in the earlier message,the reliability did not drop by 15%. The actual benchmark showed a tiny delta of only -0.000504 (the overall reliability is still around 99.7%).I just mentioned 85% because reliability_floor=0.85 is hardcoded at the bottom of the task.py file,so basically we got 2.09x without hurting the accuracy at all.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I am less satisfied with that, I am ok with merging it but as it would need more detailed testing as the distribution through the CRC32 and Adler32 tends to form clusters so I would be rating it as easy.

h2 = zlib.adler32(raw)

step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1

return [((h1 + i * step) % bit_count) for i in range(hash_count)]
Comment on lines +15 to +17
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate bit_count and hash_count before generating probe positions.

bit_count <= 0 can break modulo math, and hash_count <= 0 yields an empty probe list, which makes all(...) true for every query in solve and inflates positives.

Proposed fix
 def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
     """Computes probe positions for a Bloom filter using double hashing."""
+    if bit_count <= 0:
+        raise ValueError("bit_count must be > 0")
+    if hash_count <= 0:
+        raise ValueError("hash_count must be > 0")
+
     raw = value.encode()
     h1 = zlib.crc32(raw)
     h2 = zlib.adler32(raw)

     step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1

     return [((h1 + i * step) % bit_count) for i in range(hash_count)]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1
return [((h1 + i * step) % bit_count) for i in range(hash_count)]
def _bloom_hashes(value: str, bit_count: int, hash_count: int) -> list[int]:
"""Computes probe positions for a Bloom filter using double hashing."""
if bit_count <= 0:
raise ValueError("bit_count must be > 0")
if hash_count <= 0:
raise ValueError("hash_count must be > 0")
raw = value.encode()
h1 = zlib.crc32(raw)
h2 = zlib.adler32(raw)
step = 1 + (h2 % (bit_count - 1)) if bit_count > 1 else 1
return [((h1 + i * step) % bit_count) for i in range(hash_count)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chuck/tasks/memory_index/task.py` around lines 16 - 18, Validate and reject
invalid parameters before computing probe positions: check that bit_count > 0
and hash_count > 0 in the function that computes the probe list (the code using
bit_count, hash_count, h1, h2 and producing the list comprehension), and if
either is invalid raise a ValueError (or return a clear error) so you don't
perform modulo by zero or return an empty list; update callers (e.g., solve) to
handle the error if necessary and ensure the step computation and list
comprehension only run when bit_count > 0 and hash_count > 0.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this double hashing suited for bloom filter?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I feel double hashing is a standard way to build Bloom filters. I used it because it lets us simulate as many hash functions as we need using just two base hashes, which basically cuts the workload down from O(k) to O(1) which makes it very fast. It also makes the code run way faster since we aren't doing 8 or 16 different hash calculations for every single item.



def generate(count: int, seed: int) -> dict[str, Any]:
Expand Down
Loading