This section looks at a structure often called a **Bloom filter**^{1}.
A Bloom filter is an implementation of a set which allows a certain probability
of **false positives** when determining if a given object is a member of the set,
in return for a reduction in the amount of memory required for the set.
It is essentially an extension of the idea of using hash codes as an index to a BitSet for the purpose of duplicate
elimination.

It effectively works as follows:

- we allocate
to represent the set data;*m*bits - we write a hash function that, instead of a single hash code,
produces
for a given object;*k*hash codes - to add an object to the set, we derive
**bit indexes from all**and set those bits;*k*hash codes - to determine if an object is in the set, we again calculate the corresponding
hash codes and bit indexes, and say that it is present if and only if
.*all*corresponding bits are set

The margin of error (or *false positive rate*) thus comes from the fact that
as we add more and more objects to the set, we increase the likelihood of "accidentally"
setting a combination of bits that corresponds to an object that isn't actually
in the set. In an example of using a bloom filter to represent a set of strings,
we find that allocating 8 bits per item gives us a false positive rate
of around 4.9% with 2 hash codes and 2.3% with 4 hash codes^{2}. Within certain
parameters, we can achieve a certain false positive rate either by
increasing the number of bits per item (hence overall memory used) or by
increasing the number of hash codes (and hence CPU usage required per lookup).

Bloom filters are useful for "weeding out" items. For example, when placed on top of a database query, they can be used to avoid database "hits" for items that we know in advance (because our Bloom filter reports that they are not present) are not in the database. Because of the "false positive" phenomenon, some acceptable percentage of unnecessary database lookups will still occur, but the number of unnecessary lookups will be reduced. This tradeoff is worthwhile because accessing an area of memory locally is usually much less resource-intensive than performing a database query, which may involve disk accesses and potentially a connection to a remote server.

Now we have introduced the concept, we look at an example Bloom filter implementation in Java. Later on, we also look at the false positive rate of our Bloom filter, and thus how to tune it for our purposes.

1. Bloom filters take their name from Burton Bloom, who appears to be have been
the first to formally describe such structures in his paper *Space/Time Trade-offs
in Hash Coding with Allowable Errors* (1970). The name "Bloom filter" appears
to be a later coining, since the original paper does not give any
specific name for the structure other than "Method 2" of two "hash-coding methods
with allowable errors". ("Method 1" was in effect a hash set that stores only
the hash codes of items.)

2. Theoretical bounds on positive rates can be calculated given a particular size of filter, number of
items and number of hash codes. As I'll discuss below, because of the complexity
of factors involved— including the nature of the data and the actual quality
of the hash codes in a real life situation— I find it simpler and more useful to give
data from a real-life simulation.

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.