Repetition operators (ctd): greedy and reluctant operators
A problem that you'll come across sooner or later with repetition operators in regular expressions occurs when the expression has various operators that could match a variable number of characters. In such cases there is potential ambiguity as to which operator matches what.
We've actually already met such an example. Recall our expression to match a string containing ten digits which we composed as follows:
Now at this point, it may occur to you that .* can match "any sequence of any
character". So if we have a string, say, aab0123456789, why doesn't the initial
.* "swallow up" the entire string in one go, preventing a match?
The answer comes in the form of the following matching rules:
- operators match from left to right;
- repetition operators are greedy: they match as many characters as they can;
- but, operators are not allowed to prevent a match if one is possible.
So, let's look at what this means. Supposing we match the following string against
the above expression:
The string contains a total of 14 digits: the string 0123456789, with
11 and 22 either side. So what happens when we come to match?
Well, going from left to right, the first 'item' in the expression is .*.
How many characters does it match? Well, as many as it can without preventing the
other parts from matching if they can. So the first .* matches up to the
end of the digits, minus ten, the number that [0-9]{10} requires in
order to match. Then, the latter item takes its ten digits, 2345678922
in the above string. Finally, the second .* can match the rest of the string.
So supposing we wanted [0-9]{10} to match against the
first ten digits? One way is to transform the "greedy" .*
into a so-called reluctant operator.
Turning greed into reluctance
A reluctant operator matches against as few
characters as it can, while still letting the rest of the expression match if
it can. To make an operator reluctant, we add a question mark
after the operator. So the following expression:
means that the first .* matches against as few characters as it can,
whilst still allowing [0-9]{10} to match against ten digits, and still
allowing .* to match against "any sequence". The fewest number of
characters that .*? can match against whilst leaving ten subsequent
digits is the sequence aax; then, the digits matched by the middle element
are 1101234567.
Alternative to reluctant operators
A sometimes clearer alternative to using reluctant operators in some cases
is to replace the dot with a more exact character class. For example, we could
write the following:
Recall from our section on character
classes that [^0-9] means "any character that isn't a digit".
So we match any sequence of non-digits, followed by ten digits
(in effect, the first ten in the string), followed by the rest of the string.
Why do I need to know which part matches what?
Controlling which part of the expression matches which part of the string
is important when you use a feature called capturing.
To understand capturing, we need to start by looking at how to
use two explicit classes to control regular expression matching: the
Pattern and Matcher classes.
Written by Neil Coffey. Copyright © Javamex UK 2008. All rights reserved.
|