This section looks at a structure often called a Bloom filter1.
A Bloom filter is an implementation of a set which allows a certain probability
of false positives when determining if a given object is a member of the set,
in return for a reduction in the amount of memory required for the set.
It is essentially an extension of the idea of using hash codes as an index to a BitSet for the purpose of duplicate
It effectively works as follows:
- we allocate m bits to represent the set data;
- we write a hash function that, instead of a single hash code,
produces k hash codes for a given object;
- to add an object to the set, we derive bit indexes from all k hash codes
and set those bits;
- to determine if an object is in the set, we again calculate the corresponding
hash codes and bit indexes, and say that it is present if and only if
all corresponding bits are set.
The margin of error (or false positive rate) thus comes from the fact that
as we add more and more objects to the set, we increase the likelihood of "accidentally"
setting a combination of bits that corresponds to an object that isn't actually
in the set. In an example of using a bloom filter to represent a set of strings,
we find that allocating 8 bits per item gives us a false positive rate
of around 4.9% with 2 hash codes and 2.3% with 4 hash codes2. Within certain
parameters, we can achieve a certain false positive rate either by
increasing the number of bits per item (hence overall memory used) or by
increasing the number of hash codes (and hence CPU usage required per lookup).
Bloom filters are useful for "weeding out" items. For example, when placed on top
of a database query, they can be used to avoid database "hits" for items that we
know in advance (because our Bloom filter reports that they are not present) are
not in the database. Because of the "false positive" phenomenon, some acceptable
percentage of unnecessary database lookups will still occur, but the number of
unnecessary lookups will be reduced. This tradeoff is worthwhile because accessing an
area of memory locally is usually much less resource-intensive than performing
a database query, which may involve disk accesses and potentially a connection
to a remote server.
Now we have introduced the concept, we look at an example Bloom filter implementation
in Java. Later on, we also look at the false
positive rate of our Bloom filter, and thus how to tune it for our purposes.
1. Bloom filters take their name from Burton Bloom, who appears to be have been
the first to formally describe such structures in his paper Space/Time Trade-offs
in Hash Coding with Allowable Errors (1970). The name "Bloom filter" appears
to be a later coining, since the original paper does not give any
specific name for the structure other than "Method 2" of two "hash-coding methods
with allowable errors". ("Method 1" was in effect a hash set that stores only
the hash codes of items.)
2. Theoretical bounds on positive rates can be calculated given a particular size of filter, number of
items and number of hash codes. As I'll discuss below, because of the complexity
of factors involved— including the nature of the data and the actual quality
of the hash codes in a real life situation— I find it simpler and more useful to give
data from a real-life simulation.