Home  Java compression intro  Deflater how-to  Deflater algorithm  Deflater configuration  Text compression performance  GZIP files  ZIP files

Search this site:
Threads Database Profiling Regular expressions Random numbers Compression Exceptions C Equivalents in Java

Problems with ZIP files (in Java)

The ZIP file format is generally a widely-accepted, useful format. However, it does have a couple of quirks that you should be aware of. Both relate in some way to internationalisation.

Encoding of filenames

Many operating systems now allow non-ASCII characters in filenames. If you're not too familiar with character sets and character encoding, ASCII characters are essentially unaccented letters, numbers and a few symbols (plus a few "unprintable" control codes that we'll ignore for now). As bytes, they're usually all encoded in the range 0-127. Or put less technically, they're the characters "available on the average 1980s micro sold in the US".

In the early days of operating systems such as DOS or various home computing platforms, using anything other than these characters (such as accented characters, other alphabets etc) was non standardised and not well supported. Putting things such as accented characters and even spaces in file names was just "something that you didn't do", and was often not actually possible. So formats such as ZIP tended not to worry too much about the issue.

Unfortunately, nowadays it generally is possible to put any character in file names and in strings in general, and there are various different standards for dealing with this (common standards include Unicode encoded with UTF-8, ISO-88591-1 which is a one-byte-per-character encoding for various European languages, etc). For ZIP files, there's no standard encoding, and no standard way to indicate which encoding you've used. So different tools will generally pick some encoding arbitrarily. If the tool you use to create the ZIP file name has the same encoding as the one used to read it, then all is generally merry. If not, beware the dragons.

Java currently expects UTF-8 encoding. This generally means that if you create a ZIP file in Windows, filenames with non-ASCII characters will be mangled.

The issue has been raised as Java bug ID 4244499, with a proposal to add a ZipFile constructor to take the character encoding (assuming the caller knows it). At the moment, a possible solution is to use the Arcmexer library, which allows you to set the file name encoding. This also has the advantage of being able to read encrypted ZIP files.

My advice is generally:

Don't use accents or other non-ASCII characters in filenames1! You really don't need them that much.

One historical problem is that filenames were never really intended to be a human-readable "title", but increasingly, that's how many "average users" are treating them. For the time being, there's no elegant solution to this problem, and the best we can really do is live with filenames with missing diacritics or occasional spelling changes to fit ASCII.

Time zones

A similar issue occurs with time zones. Timestamps in ZIP files are stored in milliseconds since a particular reference point, but there's no way to say "relative to which time zone". Most ZIP tools will simply use— when either reading or writing— the time zone of the local machine at the time of creating the ZIP file.

Again, a possible solution is to use Arcmexer, which allows you to specify the locale in which the zip file is assumed to have been created, and will convert times to the current locale accordingly.


1. By the way, call me a fuddy-duddy, but I would also say don't use spaces in filenames. I guess I just use command-line tools too much.

comments powered by Disqus

Written by Neil Coffey. Copyright © Javamex UK 2012. All rights reserved.