This section looks at a structure often called a Bloom filter1. A Bloom filter is an implementation of a set which allows a certain probability of false positives when determining if a given object is a member of the set, in return for a reduction in the amount of memory required for the set. It is essentially an extension of the idea of using hash codes as an index to a BitSet for the purpose of duplicate elimination.
It effectively works as follows:
The margin of error (or false positive rate) thus comes from the fact that as we add more and more objects to the set, we increase the likelihood of "accidentally" setting a combination of bits that corresponds to an object that isn't actually in the set. In an example of using a bloom filter to represent a set of strings, we find that allocating 8 bits per item gives us a false positive rate of around 4.9% with 2 hash codes and 2.3% with 4 hash codes2. Within certain parameters, we can achieve a certain false positive rate either by increasing the number of bits per item (hence overall memory used) or by increasing the number of hash codes (and hence CPU usage required per lookup).
Bloom filters are useful for "weeding out" items. For example, when placed on top of a database query, they can be used to avoid database "hits" for items that we know in advance (because our Bloom filter reports that they are not present) are not in the database. Because of the "false positive" phenomenon, some acceptable percentage of unnecessary database lookups will still occur, but the number of unnecessary lookups will be reduced. This tradeoff is worthwhile because accessing an area of memory locally is usually much less resource-intensive than performing a database query, which may involve disk accesses and potentially a connection to a remote server.
Now we have introduced the concept, we look at an example Bloom filter implementation in Java. Later on, we also look at the false positive rate of our Bloom filter, and thus how to tune it for our purposes.
1. Bloom filters take their name from Burton Bloom, who appears to be have been
the first to formally describe such structures in his paper Space/Time Trade-offs
in Hash Coding with Allowable Errors (1970). The name "Bloom filter" appears
to be a later coining, since the original paper does not give any
specific name for the structure other than "Method 2" of two "hash-coding methods
with allowable errors". ("Method 1" was in effect a hash set that stores only
the hash codes of items.)
Written by Neil Coffey. Copyright © Javamex UK 2012. All rights reserved. If you have any feedback on the Java collections tutorials in this section or about the content of this site in general, please leave a message on the Javamex forum.