How the String hash function works (and implications for other hash functions)
In our introduction to hash codes, we mentioned the
principal design aim of a generic hash code. By generic, we
mean a hash code that will cope with fairly "random typical" input and distribute
the corresponding hash codes fairly randomly over the range of integers (4 bytes
in the case of Java). That way, whatever the size of our hash table, we assume that
the keys will be distributed reasonably evenly among the buckets. For those that
don't want to delve too deeply into the technical details, we gave some basic
guidelines for writing a hash function.
On this page, we will delve a little more deeply into the technical details
of hash codes. As an example, we'll look specifically at the hashCode() method
used by the Java String class.
Randomness, mathematically speaking
There's actually quite a lot of overlap between the field of
random number generators
and the field of hash functions. For now, we will start by borrowing
the concept of a randomness. We will say that a number (such as a hash code) is random
if all its bits have an equal and independent chance of being either 0 or 1. Another
general concept we'll borrow is that some initial randomness has to come from somewhere.
A random number generator, is generally designed to take some unpredictable seed–
such as the number of nanoseconds since the computer was switched on–
as its initial source of randomness, and then from that generate a further
(approximately) random sequence.
Where does randomness come from?
Now let's get back to hashing the characters of a string and let's look at some typical
character data. We'll consider generating a series of random strings, where each successive
character has a particular change of being within a particular range:
- 50% of the characters are digits from 0-9; if the character has a digit,
each has an equal chance of appearing (so overall, 5% of characters will be
the digit "0", 5% will be the digit "1" etc);
- otherwise, the character is a letter;
- when the character is a letter, each letter 'A' to 'Z' has an equal
change of appearing, but on each letter there is a 50% chance that that
letter will be lower rather than upper case.
Plot of (bit number, probability of that bit being set) for
an example character distribution
This distribution of characters is essentially made up for illustration
purposes. But the point is that any set of strings for a particular purpose will
generally have some characteristic distribution. (There may well be interactions between
successive characters too: for example in English, u is a relatively rare
letter, but after a q becomes extremely likely as the next letter. We won't
worry too much about this here.) Now, given this character distribution, we can calcualte
the probability of each bit being set for a random character in a string. The
graph above right shows this probability distribution.
It is clear from the graph that the bits of the characters are not
randomly distributed. If they were, we'd expect to see a flat line at 50%. Instead,
the first three or four bits "hover" around and below the 50% mark– i.e. are
"more or less random"– but higher bits are heavily biased towards particular
values. For a given character, there is a better than 70% chance that bit
4 will be set, and a 75% chance that bit 5 will be set. This lack of randomness
happens almost by design– the ASCII standard was built with certain
correspondences in mind, such as bit 5 being set to mark a lower case character.
So what can we do create more-or-less random hash codes? Well, we can take
those bits that are more or less random (the low bits) and, as we build
up the hash code, try and make
them interact so that they "spread the randomness out" over the integer.
On the next page we look in more detail at spreading
the random bits in a hash function.