Skip to content

Bad collisions for frozenset hashes #150134

@tim-one

Description

@tim-one

Bug report

Bug description:

This came up long ago when I was working on hardening the tuple hash, but got dropped on the floor.

It recently came up again in the context of frozendict haahing, which just copied much of the frozenset hashing code:

#149841

The problem is that shuffle_bits() only propagates changes from low-order bits to higher. Any function using only left shifts and multiplies suffers from this.

But hash codes may differ only in high-order bits to begin with. Avalanche is horrid in the absence of a stronger bit shuffler, which also propagates higher-to-lower.

It so happens that hash codes of low-precision floats do suffer from this.

>>> len(set(hash(i/256) for i in range(256))
256

So they're all different, but the variation is only in the topmsot bits:

>>> for i in range(10):
...     print(hex(hash(i/256)))
...
0x0
0x20000000000000
0x40000000000000
0x60000000000000
0x80000000000000
0xa0000000000000
0xc0000000000000
0xe0000000000000
0x100000000000000
0x120000000000000
>>>

So frozenset struggles with them. First block of output from the code at the end, which analyzes the hashes of frozensets of all ways of taking 3 elements from those 256 floats (of course it doesn't matter if the same values are spelled as Fraction and/or Decimal instead).

number of possible hash codes: 18_446_744_073_709_551_616

 number of hash codes total: 2_763_520
number of unique hash codes: 2_048
       number of collisions: 2_761_472

mean expected collisions: 0.00
            and its sdev: 0.00
                       z: +6069500754.14

worst pileups:
    2_020
    2_015
    2_005
    2_005
    2_002
    2_001
    1_999
    1_993
    1_988
    1_980

The z score is crazy high, but "pileups" matter much more than that. That's the number of distinct inputs that map to the same hash code. It's the length of collision chains the drive catastrop;ic slowdowns, not really the number of collisions.

The second block of output shows what happens instead if we just xor together the hashes of 1-tuples containing the same values:

number of possible hash codes: 18_446_744_073_709_551_616

 number of hash codes total: 2_763_520
number of unique hash codes: 2_753_525
       number of collisions: 9_995

mean expected collisions: 0.00
            and its sdev: 0.00
                       z: +21968232.90

worst pileups:
    4
    4
    3
    3
    3
    3
    3
    3
    3
    3

The z score is still very high, but max pileup has become quite minor. Move to 4-element subsets and max pileup for frzenset zooms to over 100 thousand, while the xor of 1-tuple hashes peaks at a modest 13.

Tuple hashing doesn't do full-blown xxHash. In the full version, there's a long stretch at the end to work harder at avalanche. The per-element version tuple uses skips that. There's some useful left-to-right propagation via its rotate operation, but for 1-element tuples that's a "one and done" thing. High-order bits can't affect low-order ones further away than the one-tiee rotate count, Its effect in this context is nevertheless dramatic.

So what to do? If the problem is lack of left-to-right propagation, introduce some. Everyone else does ;-)

  • Roll our own. Even one ju8icious right shift could help a lot. Rohate (as in xxHash) even more (rotate propagates in both directions).

  • Look at adopting the tail end of xxHaah.

  • The C++ Boost library's combine_hashes() also propagates in both directions.

  • As does the "tempering" code before the Mersenne Twister delivers a result. The underlying generator is very slow at propagating bit changes across the state space, so they tacked on a "heavy avalanching sequence" to disguise that weakness. But even so it can very famously fall into "zeroland regions" (if most of state space bits happen to become zero, it cah take millions of iterations to break free of that).

As for frozendict, it doesn't appear to need any of this. The current PR uses a much stronger scheme based on copy/pasting two rounds of tuple's xxHash code, one round for the key hash and another to fold in the value hash. That's bound to be stronger than just (as in the attached code) xor'ing the hashes of 1-tuples, and it's almost certainly a complete waste of cycles to go on to call shuffle_bits() now. The propagation it does is weak (and zero of left-to-rght) compared to xxHash's core.

At a higher level, the spread of copy/pasted "mystery code" across modules is concerning. I'm old enough to know where all the bodies are buried, but the next generation isn't. If we have to duplicate code (and typically without enough comments to trace the code back to its original source), let's at least stick it in a new include file? Then there's just one place to look for comprehensive explanations and cautions - and no possibility of getting out of sync by accident. This kind of code is the opposite of "obvious" 😉.

Driving code:

from test import support
from collections import Counter
import math

def fint(n):
    return f"{n:_}"

def analyze(hashcounts, nbins=1<<support.NHASHBITS):
    print("number of possible hash codes:", fint(nbins))
    assert all(hashcounts.values()) # counts of 0 not allowed
    print()
    numtotal = hashcounts.total()
    numseen = len(hashcounts)
    ncollision = numtotal - numseen
    print(" number of hash codes total:", fint(numtotal))
    print("number of unique hash codes:", fint(numseen))
    print("       number of collisions:", fint(ncollision))
    print()
    mean, sdev = support.collision_stats(nbins, numtotal)
    print("mean expected collisions:", format(mean, ".2f"))
    print("            and its sdev:", format(sdev, ".2f"))
    diff = ncollision - mean
    z = diff / sdev
    print("                       z:", format(z, "+.2f"))
    print()
    print("worst pileups:")
    for h, c in hashcounts.most_common(10):
        print("   ", fint(c))

N, K = 8, 3
from itertools import combinations
def create(N=N, K=K):
    c = Counter()
    pow2 = 1 << N
    small = tuple(i / pow2 for i in range(pow2))
    for t in combinations(small, K):
        c[hash(frozenset(t))] += 1
    assert c.total() == math.comb(pow2, K)
    return c

def create2(N=N, K=K):
    c = Counter()
    pow2 = 1 << N
    small = tuple(i / pow2 for i in range(pow2))
    for t in combinations(small, K):
        h = 0
        for e in t:
            h ^= hash(tuple([e]))
        c[h] += 1
    assert c.total() == math.comb(pow2, K)
    return c

analyze(create())
analyze(create2())

CPython versions tested on:

CPython main branch, 3.14, 3.13

Operating systems tested on:

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagetype-bugAn unexpected behavior, bug, or error
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions