Advanced use of hash codes in Java: doing away with the keys
On the previous page, we discussed some hash code
statistics: specifically the number of collisions that we expected if we took
n hash codes from a given range of possible values. We mentioned, for example, that
if we take in the order of 100,000 random 32-bit hash codes, we'd expect in the order of 2 collisions,
with 8 collisions if we take about 200,000, 32 if we take 400,000 etc. Recall that we're talking
purely about the number of pairs1 of objects with identical hash codes, not about the
actual number of collisions in the hash table (i.e. the number of buckets that
would have more than one item), which we expect to be much greater. A couple of interesting
possibilities emerge at this point:
- We could use the hash code as the key to the hash table instead of the
actual object. This would mean, of course, that there'd be a risk that we'd ask for the object
mapped to a given key, and actually get the wrong value (i.e. one mapped to another
object with the same hash code). But the risk would be calculable, and
might be small enough for some applications.
- If the number of collisions was too great for the number of objects we want to key on, we
could use a wider hash code. For example, using 64-bit hash codes would mean we
could take in the order of a billion random objects before expecting to
get a hash code collision (we'd expect an average of 2 collisions with 233 = around 8 billion objects).
So why just store hash codes rather than the keys if we know it's not completely accurate?
Well, partly for speed (comparing two numbers is quicker than comparing two
strings) but mainly to reduce memory usage in the case where our keys are relatively
large. Potentially we can also reduce the number of objects created to store
the structure, if we're prepared to write our own map implementation: if our keys are
primitive types (ints or doubles), we can actually store them in a
single array. Of course, storing only the hash codes would also mean that we couldn't iterate
over the keys as we can with a regular Java hash map or hash set. So the technique would
only be suitable for retrieving the mapped values, or for determining if a
given key is in the map.
On the next pages, we consider keying on hash code
and using using hash codes with
a BitSet to significantly reduce memory usage for the purpose of duplicate elimination.
1. We're of course making an approximation: we're not distinguishing between the case
of n colliding pairs, and the case where more than 2 objects have an identical hash code.
Our calculations are based on the assumption that each collision involves a different pair of colliding objects.
For most practical cases, this turns out to be a good approximation.