Bloom Filters: the false positive rate (analysis)
On the previous page we looked at behaviour of the
false positive rate
of our Java Bloom filter implementation.
Ultimately, this behaviour is a combination of various factors involved:
not only the "theoretical" behaviour assuming truly random selection of objects
and perfect hash codes, but to a small extent, the actual distribution of possible object values
and the quality of our hash function. So I think that looking at some "real life" data
is important. However, for completeness, I will look briefly here at how we can
approximate the behaviour of the Bloom filter from a more mathematical point of view.
It turns out that the exact mathematics of how Bloom filters perform is rather
complex. Anyone with a deep interest in this subject is encouraged to consult
Bose et al (2007), On the false-positive rate of Bloom filters. For our purposes,
we will make do with two simple approximations, which turn out to be good enough
in most cases.
Mathematical approximation of the false positive rate
There are two simple mathematical approximations to the false positive rate.
The logic in each case is similar and both approximations give similar results,
differing subtly in the assumptions they make.
The first approximation appears in Bloom's original 1970 paper.
Consider a Bloom filter with k hash codes and an underlying
bit set of m bits. This means that each time we add an item, we expect to
set k bits, or proportionally, k/m of the bits. So after the first item
is inserted, we assume the proportion of bits still not set is 1-k/m.
After the second item is inserted, we assume that proportion will be (1-k/m)(1-k/m).
After the third item, (1-k/m)(1-k/m)(1-k/m) etc, so that
after n items are inserted, we expect the proportion of bits still to
be clear to be (1-k/m)n.
This is an oversimplification on two grounds. It ignores the fact that very occasionally
two or more of the k bits set for a given item will coincide and we won't
actually change the state of k bits. (Clearly, this chance becomes more miniscule
as m is much greater than k, which will generally be the case.)
Perhaps more significantly, it
assumes that bits are set independently of one another, whereas in reality,
the more bits that are set, the more chance of a bit "colliding" with one already set
and not altering the proportion of bits set. But for Bloom filters that aren't
"too full", it's a good enough approximation. Then, given a random item that we
happen to know isn't actually in the set, the chance that our Bloom filter will
erroneously report that it is present is equal to the likelihood that all corresponding
k bits for that item are set. In other words, if (1-k/m)n
is the proportion of bits that are clear, then for a each individual bit to be
set it must be in the remaining 1-(1-k/m)n proportion of bits.
Thus the overall likelihood of all k bits being set is assumed to be this
proportion multiplied by itself k times:
approx f. p. rate = (1-(1-k/m)n)k
The second approximation— and the one that appears to be most common in
descriptions of Bloom filters— follows a slightly different line of logic but arrives
at a similar result.