Problems with ZIP files (in Java)
The ZIP file format is generally a widely-accepted, useful format. However, it does have
a couple of quirks that you should be aware of. Both relate in some way to
Encoding of filenames
Many operating systems now allow non-ASCII characters in filenames.
If you're not too familiar with character sets and character encoding, ASCII characters
are essentially unaccented letters, numbers and a few symbols (plus a few "unprintable"
control codes that we'll ignore for now). As bytes, they're usually all encoded
in the range 0-127. Or put less technically, they're the characters "available
on the average 1980s micro sold in the US".
In the early days of operating systems such
as DOS or various home computing platforms, using anything other than these characters
(such as accented characters, other alphabets etc) was non standardised and not
well supported. Putting things such as accented characters and even spaces in file names was just
"something that you didn't do", and was often not actually possible. So formats such
as ZIP tended not to worry too much about the issue.
Unfortunately, nowadays it generally is possible to put any character in
file names and in strings in general, and there are various different standards for dealing
with this (common standards include Unicode encoded with UTF-8, ISO-88591-1 which is a
one-byte-per-character encoding for various European languages, etc). For ZIP files,
there's no standard encoding, and no standard way to indicate which encoding you've
used. So different tools will generally pick some encoding arbitrarily. If the tool you
use to create the ZIP file name has the same encoding as the one used to
read it, then all is generally merry. If not, beware the dragons.
Java currently expects UTF-8 encoding. This generally means that
if you create a ZIP file in Windows, filenames with non-ASCII characters will be mangled.
The issue has been raised as Java bug ID 4244499, with a proposal to add a ZipFile constructor to take the character encoding
(assuming the caller knows it). At the moment, a possible solution is to
use the Arcmexer library, which allows you to set the
file name encoding. This also has the advantage of being able to read encrypted ZIP files.
My advice is generally:
Don't use accents or other non-ASCII characters in filenames1!
You really don't need them that much.
One historical problem is that filenames were never really intended to be a human-readable "title",
but increasingly, that's how many "average users" are treating them. For the time being, there's
no elegant solution to this problem, and the best we can really do is live with filenames with
missing diacritics or occasional spelling changes to fit ASCII.
A similar issue occurs with time zones. Timestamps in ZIP files are stored
in milliseconds since a particular reference point, but there's no way to say "relative
to which time zone". Most ZIP tools will simply use— when either reading or writing—
the time zone of the local machine at the time of creating the ZIP file.
Again, a possible solution is to use Arcmexer, which allows
you to specify the locale in which the zip file is assumed to have been created, and
will convert times to the current locale accordingly.
1. By the way, call me a fuddy-duddy, but
I would also say don't use spaces in filenames. I guess
I just use command-line tools too much.