Consider a hash function and some random bunch of strings or objects that we apply that hash function to. In some situations, it would be advantageous if we could guarantee to all intents and purposes that each string or object will have a unique hash code. In that situation, we would not need to store the actual keys in a hash table, because the hash code alone would in effect identify that object.

On this page we look at what we'll call a **strong hash function**^{1}. A strong hash function
is one that produces hash codes with good **dispersal**. In other words, given a random set
of objects to hash, there is a high chance that our hash function will produce corresponding
hash codes that are well dispersed over the possible range of hash codes. Hence:

- whatever the size of the hash table, the number of collisions will be close to the number that we would "theoretically" expect;
- for a given hash code
**width**(i.e. the number of bits in the hash code), we can predict how likely it is for the abovementioned goal to be met— i.e. given a typical random selection of objects, for each object to be given a unique hash code.

Notice that Java's hash table implementation— and hence implementations of `hashCode()`—
don't require this goal to be met. Java maps and sets, rather than storing just the hash code,
store the actual key object. This means that implementations of `hashCode()`
generally only need to be "fairly good". It isn't the end of the world if two key objects have the
same hash function, because the keys themselves are also compared in deciding if a match has been found.

Unfortunately, with 32-bit hash codes as returned by the Java `hashCode()` function, doing away with
storing the keys will not usually be option: even with the best 32-bit hash function in the world, the chances
of a duplicate are non-negligible with fairly modest numbers of objects. Figure 1 shows, for differently
sized collectons of objects, the approximate percentage chance of two or more of those objects
resolving to the same hash code given a strong hash code:

Number of objects | Chance of a hash code collision |
---|---|

65536 (2^{16}) | 39.3% |

32786 (2^{15}) | 11.8% |

16384 (2^{14}) | 3.08% |

8192 (2^{13}) | 0.78% |

4096 (2^{12}) | 0.20% |

2048 (2^{11}) | 0.049% |

1024 (2^{10}) | 0.012% |

512 (2^{9}) | 0.0031% |

For very small numbers of items added to a hash table keyed only on a 32-bit hash code, the risk of a collision (or chance of needing to take special action) may be acceptable. For example, with 512 items in the table, we'd expect about a 1 in 32,000 chance of a collision (0.0031%). On the other hand, adding 16,384 items means there's a 3.08% or about 1 in 32 chance of two different objects sharing the same hash code. For many applications, this is a modest number of items and too high a probability.

The pattern of percentages shown in Table 1 essentially holds for hash codes of greater
widths too. Or put another way, with an *n*-bit hash code, there's a roughly 39.3% chance
of a collision with 2^{n/2} items, an 11.8% chance with 2^{n/2 - 1} items,
a 3.08% chance with 2^{n/2 - 2} items etc. Or to give a concrete example, using a 56-bit hash
we reach the 0.0031% chance of a collision after adding 2^{56/2 - 7} = 2^{21} =
2,097,152 items. Or put another way, with a good 56-bit hash code, we can safely generate hashes for
a few hundred thousand objects and expect those hash codes to be unique.

Rather than use 56-bit hash codes, we may as well use the next size up that it is convenient
to manipulate: 64 bits. After all, if we used 56-bit hash codes in Java or practically any programming
language, we'd probably end up storing them in 64-bit `long`s anyway. With 64-bit hash codes,
we can store 2^{32 - 7} or over 30 million items in our hash table before there is a 0.0031%
chance of a collision. With one million items and a perfect 64-bit hash function,
the chance of getting a collision is 1 in 3.7x10^{7}— or roughly half as likely as
winning the UK National Lottery jackpot.

On the next page, we look at an example implementation of a 64-bit hash code.

1. It should be noted that by "strong", we don't necessarily mean *secure*. In other words,
it doesn't matter to us if the hash function is reversible and that, for a given hash code,
it is feasible to compute a string/object that has that hash code. Conversely, so-called **secure hash functions**
such as the SHA series aim to make such reversal unfeasible. A secure hash function is also a strong hash
function, but not necessarily vice versa.