Search this site:
How Deflater works
On this page we look briefly at how the compression algorithm which underlies Java's Deflater works. As mentioned in the introduction, Deflater is actually a wrapper around the zlib library, which implements the deflate algorithm.
The deflate algorithm is essentially a combination of two mechanisms:
The compressor works within a "window" of data. It looks at the upcoming data to be compressed, and looks to see if it already encountered a matching sequence in the data recently processed. For example, consider the following sequence, where the character in bold is the character currently being processed, and those to the left have already been processed:
A B A B C D A B A B C D A...
The compressor sees that it has a sequence "AB" to compress, and that that sequence occurred 2 characters ago. So it can encode a reference looking something like (2, 2) to mean "same sequence as two characters ago, length 2 characters" (of course, that in turn was also probably encoded as a repetition of the earlier AB sequence). However, if it is configured to search deeper, it might also find that "ABC"– or even "ABCD"– occurred 6 characters ago. As you can see, how exactly the compressor matches upcoming input against previous sequences is something that is up to the Deflater (in fact the zlib) implementation, and it turns out that it can be configured to some extent. Specifically, we can configure a Deflater's compression level to indicate how far the deflater looks for matching sequences, resulting in a tradeoff between CPU and compression ratio.
If a character occurs which the compressor decides doesn't match any previous sequence (possibly because it doesn't look far enough), then it is simply encoded as itself.
As you can see from this description, the deflate algorithm will generally favour streams with repeating sequences of bytes. (Note that we've used letters for the sake of illustration, but all byte values are treated equally by the compressor: it doesn't specifically favour those representing printable characters.)
After applying dictionary compression, Huffman encoding is applied to the output. On the next page, we look at Huffman encoding in the DEFLATE algorithm.
Written by Neil Coffey. Copyright © Javamex UK 2012. All rights reserved.