Regular expression example: IP location (ctd)

Continued from our example of extracting the country code from a Google referrer string.

google.com with a language code

In this case, we want to match only if the domain is exactly www.google.com. For flexibility, we'll allow the http:// protocol prefix to be optional. After the domain name, we could have any number of characters preceding the hl=XX parameter, and any number after. So we'll put .* before and after, and our expression looks as follows:

Pattern p = Pattern.compile("(?:http://)?www\\.google\\.com/.*hl=([a-z]{2}).*");

Country suffix on the domain

In the second type of case, we match for any google domain ending in a two-letter code, possibly preceded by com (cf .com.au) or co (cf co.in and co.uk). In the part to handle the optional .com or .co, we want to create a non-capturing group, putting a ? after it to make it optional, and put the two alternatives, separated by a pipe. This part of the expression is thus:

(?:\.com|\.co)?

Note the backslashes before the dot, because we don't want the dot to mean "any character" in this case. In an actual Java string, the backslashes will be doubled, as below.

Putting the whole thing together, we get the following expression:

Pattern p = Pattern.compile("(?:http://)?" +
    "www\\.google(?:\\.com|\\.co)?\\.([a-z]{2})/.*");

Notice that after the two-letter code (inside the capturing group), we specify a slash to make sure this really is the final suffix of the domain name. After that, we put .*: in this case, we don't care about any parameters.

On the next page, we put these multiple expressions together into a function to get the country code from a referrer string, comparing against all three of the patterns as necessary.


If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.