Saving memory used by Java strings: a one-byte-per-character `CharSequence` implementation

As mentioned in our discussion of how to save memory occupied by Java Strings, many methods and operations will actually work on any old CharSequence: it doesn't specifically have to be a String. In case you're not familiar with it, CharSequence is an interface that requires us to provide methods to:

retrieve the length of the sequence;
retrieve the character at a given position;
create a CharSequence instance representing a given subsequence;
create a String from the CharSequence.

Library classes that store sequences of characters (String, StringBuffer, StringBuilder, CharBuffer) already implement CharSequence. But significantly, if we define some object holding our string that has just these methods, then we are free to store the characters more compactly if we can. And the resulting object can be used whenever a method requires only a CharSequence: i.e. as a drop-in replacement for String when storing, printing, writing to file— even running regular expressions on the string.

To implement our character sequence, we can fairly easily do the following:

store the characters of our string in a byte array, with one byte per character (i.e. we only allow ASCII characters or some range of 256 characters, not full Unicode!);
provide the above wrapper methods to pull appropriate bytes out of the array and convert to characters.

We'll call our class the CompactCharSequence and it will start as follows:

public class CompactCharSequence implements CharSequence, Serializable {
  static final long serialVersionUID = 1L;

  private static final String ENCODING = "ISO-8859-1";
  private final int offset;
  private final int end;
  private final byte[] data;

  public CompactCharSequence(String str) {
    try {
      data = str.getBytes(ENCODING);
      offset = 0;
      end = data.length;
    } catch (UnsupportedEncodingException e) {
      throw new RuntimeException("Unexpected: " + ENCODING + " not supported!");
    }
  }

  public char charAt(int index) {
    int ix = index+offset;
    if (ix >= end) {
      throw new StringIndexOutOfBoundsException("Invalid index " +
        index + " length " + length());
    }
    return (char) (data[ix] & 0xff);
  }

  public int length() {
    return end - offset;
  }
}

This part should hopefully be fairly straightforward. We provide a constructor with which we can create an instance of CompactCharSequence, passing in the String containing the characters we want to store. We could of course allow something else such as a char array to be passed in. When passing in a String, we use the getBytes() method to retrieve the contents of the string as a byte array, asking for a 1-byte-per-character encoding. In this case, ISO-8859-1 is the standard 1-byte-per-character encoding used for many European languages (notably except Russian). Notice that we store also an offset and end, because in our implementation we choose to share the array when subsequences are created. That's just a design decision: we could instead copy a section of the array each time and dispense with the two extra fields.

The charAt() and length() methods are then fairly trivial: we just need to be careful to note that the first character of the string isn't necessarily the first character of the array in our implementation, and similarly that the last character of the sequence may actually not be the last character of the array. To convert a byte to a char, we AND with 255 (0xff hex), which effectively removes the sign.

Creating subsequences

We now need to provide a method to create a subsequence of a given sequence. We said that we're going to share the array when we do so. So firstly, we provide an extra private constructor, which takes not only the array of chars but also a start offset and end point in the array:

private CompactCharSequence(byte[] data, int offset, int end) {
  this.data = data;
  this.offset = offset;
  this.end = end;
}

Now, our public subSequence() method is trivial:

public CharSequence subSequence(int start, int end) {
  if (start < 0 || end >= (this.end-offset)) {
    throw new IllegalArgumentException("Illegal range " +
      start + "-" + end + " for sequence of length " + length());
  }
  return new CompactCharSequence(data, start + offset, end + offset);
}

Implementing `toString()`

Finally, we need to override toString() to provide an actual String representation of the CharSequence. For this, we can just pass the char array into a String constructor, remembering to specify the encoding in which the bytes are stored, and the start offset in the array plus the length of the string:

public String toString() {
  try {
    return new String(data, offset, end-offset, ENCODING);
  } catch (UnsupportedEncodingException e) {
    throw new RuntimeException("Unexpected: " + ENCODING + " not supported");
  }
}

And voilà! We have a class that will store a string in roughly half the memory of a String, and is almost as convenient to use. We can pass instances of CompactCharSequence to any method that will take a CharSequence, notably PrintWriter.println() and the regular expression API. When implementing our own methods that work on strings, we should generally consider making them accept any CharSequence rather than specifically a String. And in the worst case, we can convert a CompactCharSequence to a String "on the fly" if we need to pass it to some method that requires a String.

Other methods to implement

As it stands, our CompactCharSequence wouldn't be suitable for use as a key to a hash map (a typical use of String). I actually recommend CompactCharSequence primarily for cases where we need to cache in memory the content of something such as a web page or file in string form and treat it as content, rather than where we are using the character sequence "funcionally" for something such as a hash map key. But if we did require the latter functionality, then we would need to override hashCode() and equals().

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey

Saving memory used by Java strings: a one-byte-per-character CharSequence implementation

Creating subsequences

Implementing toString()

Other methods to implement

Saving memory used by Java strings: a one-byte-per-character `CharSequence` implementation

Implementing `toString()`