Home  Collections intro  Lists  Maps  Sets  Which collection class?  Sorting  Hashing  Advanced
 Video lecture: hash tables  Bloom filters

How to use a Java set: the HashSet

On the previous page, we said that a set was a collection in which an object can be present or not.

Constructing a Set

As with Java collections in general, Set is an interface (see the introduction to using Lists for an overview of the notion of interface). When we actually construct a set, we need to pick a specific type.

The usual type of set to choose is the HashSet. The name HashSet comes from the fact that the class uses a technique called hashing to quickly add, find and remove items from the set. As with lists, we can use Java 5's generics feature to specific that we want a set of a particular type of object. So here, we'll create a set of strings:

Set<String> urlsProcessed = new HashSet<String>(500);

We also pass in an estimate of the number of elements that we want to add to the set. We didn't bother with the list (but could and arguably should have done). Giving a number of elements is not essential for collections to work, but can often make them more efficient. It's important to note that this number is just an estimate for the sake of efficiency. It's not an absolute limit, as with an array. We can add more items than our initial estimate.

Adding items to the set...

Now that we have our set, whenever we want to add an item to the set (a string in this case), we call its add() method:

String url = "http://www.javamex.com/";
urlsProcessed.add(url);

...and finding them again

Whenever we want to find out if a particular item is in the set, we use the contains() method:

if (urlsProcessed.contains(url)) {
  // ...
}

And as you might have guessed, whenever we want to remove an item, we can use the remove() method:

urlsProcessed.remove(url);

Common uses for a set

Here, we call our set urlsProcessed. A common use for a set is to remember which items in a list have 'already been seen' so that they are not processed more than once. Imagine a program that spiders web pages by pulling the list of links out of each page it comes across, then going through the linked pages in turn: we need to only process a URL if it hasn't already been processed, otherwise we'd risk going round in an infinitive loop. With our set, we can write something like this to ensure each URL gets processed a maximum of once:

public void processURLs(List<String> urls) {
  for (String url : urls) {
    if (!urlsProcessed.contains(url)) {
      process(url);
      urlsProcessed.add(url);
    }
  }
}

What if I add the same item more than once?

Recall that the point of a set is that it only allows a given item to be added once. If you try and add the same item agan, the add() method will effectively ignore the request. But its return value will tell you if the item was actually added (because it wasn't there before). In the next section, we'll see that we can use this return value to our advantage in certain cases.

A slight performance improvement

The previous example will work fine in many cases. However, it is slightly suboptimal in terms of performance. To check if an item is in the set, the contains() method must make a calculation (specifically, it must calculate something called the hash code). Then, the add() method must also calculate the same hash code to decide where to place the item. Ideally, we'd just like to calculate the hash code once.

Well, we said that we can't add an item to the set more than once, and that the add() method tells us whether or not the item was actually added because it wasn't already there. So if we don't mind adding the URL to the set just as we're going to process it, then we can do the following:

  • add the item before we decide whether to process it;
  • look at the return value from add():
    • if it returns true, that means that the item was added (because it wasn't already in the set), and so we do need to process the URL;
    • if it returns false, then the item wasn't added this time because it was already there— that obviously means we don't need to process it.

So the code looks something as follows:

public void processURLs(List<String> urls) {
  for (String url : urls) {
    if (urlsProcessed.add(url)) {
      process(url);
    }
  }
}

Enumerating the contents of a set

As well as checking for individual items, we can iterate over all the items in a set with syntax very similar to that of iterating over a list:

System.out.println("URLS processed:");
for (String url : urlsProcessed) {
  System.out.println(url);
}

Recall that a set (and particularly a HashSet) is generally unordered. In other words, when you iterate over the contents of a set as in this example, you won't necessarily get the items out in the same order that you put them in (or in any other well-defined order).

Next...

On the next page, we look at a third type of Java collection, namely the map. Maps are used in all sorts of programming situations where we want to create associations between objects.