Regular expression example: IP location

This example shows how we can use a regular expression to guess the country code of the host requesting a web page by looking at the referrer string. The referrer string is essentially the URL which the user clicked on in order to reach a given web page. Rightly or wrongly, most browsers pass this information on to the web server with every page request. Here are some examples of referrer strings:

http://es.search.yahoo.com/search?p=country+music
http://uk.search.yahoo.com/search?p=jacques+chirac
http://www.google.co.in/search?hl=en&q=java+programming
http://www.google.com.au/search?hl=en&q=sidney+shopping
http://www.google.bg/search?hl=bg&q=red+wine

As you can see, where a site has been reached via a search engine, we can look at which search engine the user was using as a clue to their location. Of course, this isn't perfect: there's nothing to stop a user from Spain from using a Bulgarian search engine or vice versa. But it turns out that many users are surprisingly patriotic about which search engine they use. If you are running a site that is primarily reached via search engines and you don't want to go to the hassle of installing a database of IP addresses to country codes, looking at the referrer string will is a reasonable compromise. So let's see how we'd construct some regular expressions to pull out the country code from strings such as the above.

The yahoo format referrer string

In these typical examples, yahoo's format is a little simpler than Google's. For our purposes, we're really only interested in the two letters before search.yahoo.com, which appears to be fixed. So here is a possible expression:

Pattern p = Pattern.compile("http://" +
  "([a-z]{2})" +
  "\\.search\\.yahoo\\.com/.*");

In all the examples here, the URL is prefixed with http://. But we could be more flexible by making this part optional. Remember that to do so, we need to create a non-capturing group and then suffix it with a ?:

Pattern p = Pattern.compile("(?:http://)?" +
  "([a-z]{2})" +
  "\\.search\\.yahoo\\.com/.*");

Either way, the two-character country code will be captured as group 1. Note that in this case we aren't interested in the search parameters or indeed anything that occurs after the domain name. We match as far as the end of the domain name and the slash (.com/) to be sure that this really is a referral from a yahoo search engine. But then we simply end the expression in .* to match any subpath and/or parameters of the referrer path.

We'll see on the next page that in the case of parsing the google referrer string, we may also want to pull out one of the parameters.


If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.