Home  Collections intro  Lists  Maps  Sets  Sorting  Hashing  Advanced

Using collections in Java: sets

A List such as an ArrayList is useful in cases where we want to store some arbitrary large group of objects in a fixed order, and possibly refer to those objects by their position in the list. But a list is not so good for another type of problem where:

  • we don't necessarily care about ordering;
  • we don't care about the number of times that an object is in the collection;
  • we just want to know if a given object is in the collection or not.

A collection in which an object may be "present or not" (but not present multiple times) is called a set.

Why use a set?

It may have occurred to you that you could use a plain old list for the above purpose. After all, List has a contains() method to see if a given object is in the list, and to preserve the present-or-not condition, we could just check before adding if the given object was already in the list, and if so not add it again. So what's the point of a "set"?

The point is simply efficiency: if your only necessary criterion is "present or not" and you don't need the extra functionality of a list, then it is possible to use a more efficient data structure. That's what Java's set implementations do.

Creating and using a set in Java

The syntax for creating a set is very similar to that of a list: there's an interface called Set, which defines what methods any set implementation must have. Then there are concrete implementations of which the most commonly used is called HashSet:

Set<String> urlsProcessed = new HashSet<String>(500);

Just as with the list, we specify the type of object that the set will hold by putting the class name in angle brackets. Then, the compiler can check that we're adding the correct type of object, and can automatically cast objects that we take from the set into the appropriate type.

We also pass in an estimate of the number of elements that we want to add to the set. We didn't bother with the list (but could and arguably should have done). Giving a number of elements is not essential for collections to work, but can often make them more efficient. It's important to note that this number is just an estimate for the sake of efficiency. It's not an absolute limit, as with an array. We can add more items than our initial estimate.

Common uses for a set

Here, we call our set urlsProcessed. A common use for a set is to remember which items in a list have 'already been seen' so that they are not processed more than once. Imagine a program that spiders web pages by pulling the list of links out of each page it comes across, then going through the linked pages in turn: we need to only process a URL if it hasn't already been processed, otherwise we'd risk going round in an infinitive loop. With our set, we can write something like this to ensure each URL gets processed a maximum of once:

public void processURLs(List<String> urls) {
  for (String url : urls) {
    if (!urlsProcessed.contains(url)) {
      process(url);
      urlsProcessed.add(url);
    }
  }
}

If we don't mind adding the URL to the set just as we're going to process it, then we can actually write the following:

public void processURLs(List<String> urls) {
  for (String url : urls) {
    if (urlsProcessed.add(url)) {
      process(url);
    }
  }
}

This is because the add() method returns a boolean to indicate whether the item was actually added to the set (returning true if it is added because it wasn't already present, or false if it was not added this time because it was already present). For reasons we'll see shortly, it turns out that for a HashSet it takes practically as long to check if an item is in a set as to actually add it, so writing the code this way makes handling the set almost twice as fast in the case where the URL hasn't already been processed. As in many cases, you need to weigh up whether a slight gain in efficiency is worth the loss in readability of the code...

Enumerating the contents of a set

As well as checking for individual items, we can iterate over all the items in a set with syntax very similar to that of iterating over a list:

System.out.println("URLS processed:");
for (String url : urlsProcessed) {
  System.out.println(url);
}

Recall that a set (and particularly a HashSet) is generally unordered. In other words, when you iterate over the contents of a set as in this example, you won't necessarily get the items out in the same order that you put them in (or in any other well-defined order).

Next...

On the next page, we look at a third type of Java collection, namely the map. Maps are used in all sorts of programming situations where we want to create associations between objects.


Written by Neil Coffey. Copyright © Javamex UK 2008. All rights reserved.