On the previous page, we explained the potential benefit of a 64-bit hash code: if the hash code is "strong" enough, then different objects are practically guaranteed to have unique hash codes for reasonably-sized collections of objects. This means that we can create hash tables and other related structures that can store only the hash codes of their keys, as opposed to the standard Java implementation which must store the key.

But how do we actually implement a strong hash function? Our goal is to create a hash function which,
given two objects, will tend to assign each a different hash code, and where those hash codes will be
*dispersed*: the relationship between an object and its hash code appears to all intents and purposes "random".
From this point of view, there is a close relationship between a hash function and a typical
software random number generator (RNG): the latter attemps to take a number and
from it produce the next number in the series with a seemingly "random" (but actually
predictable) relationship between the two. We can therefore see a hash function as an RNG
that starts with a fixed seed and at each step of the way, **produces the next "random" number in
the series and combines it with the next byte or field from the data to be hashed**.

One way of combining the data with an RNG is to use a
linear congruential generator (LCG).
Recall that in an LCG, each successive "random" number is usually produced by multiplying the
previous number by a fixed, carefully chosen *multiplier* and then adding a constant.
On each iteration, instead of adding the same constant, we can instead combine a constant dependent on
the next byte or item from the data to be hashed. (We change the constant, rather than the multiplier,
bceause the quality of an LCG is highly dependent on the multiplier.)

An implementation of this scheme suggested in Numerical Recipes (Section 7.6) uses a table of 256
random values indexed by successive bytes in the data, and recommends a multiplier suitable for an LCG
with a modulus of 2^{64} (which is effectively what we have when we mulitply using a 64-bit
`long`). A possible implementation in Java runs along these lines:

public class DataHasher { private static final long[] byteTable = createLookupTable(); private static final long HSTART = 0xBB40E64DA205B064L; private static final long HMULT = 7664345821815920749L; public static long hash(byte[] data) { long h = HSTART; final long hmult = HMULT; final long[] ht = byteTable; for (int len = data.length, i = 0; i < len; i++) { h = (h * hmult) ^ ht[data[i] & 0xff]; } return h; } }

As mentioned, the value of *HMULT* is found to be good in practice with 64-bit LCGs.
It has roughly half its bits set and is "virtually" prime (it is composed of three prime factors).
The value of *HSTART* isn't motivated by the authors, and I believe that essentially
any value would do. But what about the *byteTable*?

Strong hash code implementation: continued

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.