A strong 64-bit hash function in Java
Consider a hash function and some random bunch of strings or objects that we apply that
hash function to. In some situations, it would be advantageous if we could guarantee to all
intents and purposes that each string or object will have a unique hash code. In that situation,
we would not need to store the actual keys in a hash table, because the hash code alone would
in effect identify that object.
On this page we look at what we'll call a strong hash function1. A strong hash function
is one that produces hash codes with good dispersal. In other words, given a random set
of objects to hash, there is a high chance that our hash function will produce corresponding
hash codes that are well dispersed over the possible range of hash codes. Hence:
- whatever the size of the hash table, the number of collisions will be close to the number that
we would "theoretically" expect;
- for a given hash code width (i.e. the number of bits in the hash code), we can predict
how likely it is for the abovementioned goal to be met— i.e. given a typical random selection
of objects, for each object to be given a unique hash code.
Notice that Java's hash table implementation— and hence implementations of hashCode()—
don't require this goal to be met. Java maps and sets, rather than storing just the hash code,
store the actual key object. This means that implementations of hashCode()
generally only need to be "fairly good". It isn't the end of the world if two key objects have the
same hash function, because the keys themselves are also compared in deciding if a match has been found.
The limitations of 32-bit hash codes
Unfortunately, with 32-bit hash codes as returned by the Java hashCode() function, doing away with
storing the keys will not usually be option: even with the best 32-bit hash function in the world, the chances
of a duplicate are non-negligible with fairly modest numbers of objects. Figure 1 shows, for differently
sized collectons of objects, the approximate percentage chance of two or more of those objects
resolving to the same hash code given a strong hash code:
|Number of objects||Chance of a hash code collision|
Table 1: For n random objects with a good 32-bit hash function,
approximate chance of at least 2 of the objects having the same hash code.
For very small numbers of items added to a hash table keyed only on a 32-bit hash code,
the risk of a collision (or chance of needing to take special action) may be acceptable.
For example, with 512 items in the table, we'd expect about a 1 in 32,000 chance of a collision
(0.0031%). On the other hand, adding 16,384 items means there's a 3.08% or about 1 in 32 chance
of two different objects sharing the same hash code. For many applications, this is a modest
number of items and too high a probability.
Why 64-bit hash codes?
The pattern of percentages shown in Table 1 essentially holds for hash codes of greater
widths too. Or put another way, with an n-bit hash code, there's a roughly 39.3% chance
of a collision with 2n/2 items, an 11.8% chance with 2n/2 - 1 items,
a 3.08% chance with 2n/2 - 2 items etc. Or to give a concrete example, using a 56-bit hash
we reach the 0.0031% chance of a collision after adding 256/2 - 7 = 221 =
2,097,152 items. Or put another way, with a good 56-bit hash code, we can safely generate hashes for
a few hundred thousand objects and expect those hash codes to be unique.
Rather than use 56-bit hash codes, we may as well use the next size up that it is convenient
to manipulate: 64 bits. After all, if we used 56-bit hash codes in Java or practically any programming
language, we'd probably end up storing them in 64-bit longs anyway. With 64-bit hash codes,
we can store 232 - 7 or over 30 million items in our hash table before there is a 0.0031%
chance of a collision. With one million items and a perfect 64-bit hash function,
the chance of getting a collision is 1 in 3.7x107— or roughly half as likely as
winning the UK National Lottery jackpot.
Implementation of a 64-bit strong hash code
On the next page, we look at an example implementation of a 64-bit hash code.
1. It should be noted that by "strong", we don't necessarily mean secure. In other words,
it doesn't matter to us if the hash function is reversible and that, for a given hash code,
it is feasible to compute a string/object that has that hash code. Conversely, so-called secure hash functions
such as the SHA series aim to make such reversal unfeasible. A secure hash function is also a strong hash
function, but not necessarily vice versa.