Greedy and Lazy Quantifiers

As we learned from the previous lesson on quantifiers, they're characters to help us select multiple characters in a regular expression. While useful, there are common pitfalls developers can fall into when using them.

The Problem

Let's say we had the following text: "Hello, world".

In English-speaking countries, quotation marks are commonly used, but other countries use something called Guillemet marks. We may want to convert the quotation marks with guillemet marks to produce something like: «Hello, world».

This problem seems like something regular expressions could tackle. We can write a regular expression to find any text wrapped in quotes "..." and change them into «...».

We could use this /".+"/g regular expression to help us solve this issue. Unfortunately, it won't work as expected.

$pattern = '/".+"/';
 
$text = 'A "witch" and her "broom" is one.';
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]); // "witch" and her "broom"

Instead of finding "witch" and "broom", it found "witch" and her "broom". This expression is considered to be greedy because it finds too many results.

Greedy Matching

To understand why this problem occurs, we have to understand how regular expressions work. The engine that powers regular expression goes through a relatively straightforward process to find a match.

First, we must understand the idea of a position. In the context of regular expressions, a "position" refers to a specific spot or location within the text that is being searched. It's like a pointer or marker that indicates where the regular expression engine is currently looking in the text.

For example, consider the text "Hello, world!" Each character, space, or punctuation mark has a unique position:

Position 0: H
Position 1: e
Position 2: l
Position 3: l
Position 4: o

So on and so forth.

When you use a regular expression to search for a pattern in the text, the engine starts at the first position (usually position 0) and then moves through each position, one by one, trying to find a match for the pattern you've defined.

The concept of position is vital for understanding how regular expressions work, as it forms the basis of how the pattern matching progresses through the text.

Now let's try applying the regular expression we've written ".+" against our example.

Step 1

The pattern we've written searches for a quote (").

It starts at the beginning of the text, which is position 0 in our string A "witch" and her "broom" is one.. But at that spot, it finds the letter A, not a quote, so there's no match.

Moving on to the next spot, and then the next, looking for that elusive quote. After two more tries, at the 3rd position in the text, it finally finds what it's looking for: the quote (")!

Step 2

Now that the quote is detected, the engine applies the rest of the pattern against our string. The . character in our regular expressions allows for any character to be selected except for a new line. As a result, the w character matches this criteria.

Step 3

Here's where things become interesting. Since the + quantifier is added after the ., we're telling the engine to repeat the . again for the next character. This process repeats until it reaches the end of the string.

Step 4

Once we reach the end of the string, the .+ portion of the pattern can no longer be applied. However, this creates a dilemma for the engine. It's able to recognize that our pattern hasn't been completely satisfied. It still needs the match to end with a quote (") character.

As a result, the engine performs what's known as backtracking. Backtracking in regular expressions refers to the process where the matching engine reverses its progress to reevaluate previously made decisions. This usually occurs when a pattern seems to match initially but fails later on.

Think of it as a puzzle-solving approach. If you reach a point where you realize that the current path won't lead to a solution, you step back to a previous decision point and try a different path.

In our example, it'll attempt to match the " against a . in our string. These characters don't match. So, it'll go over to the next character.

Step 5

Backtrack until finding a quote character

Backtracking is repeated until the pattern can find a " character. Eventually, it will. As a result, the regular expression finds "witch" and her "broom".

Hopefully, this now starts to make sense why we're receiving this result. By default, regular expressions are in greedy mode. Greedy mode in regular expressions refers to a behavior where the engine tries to match as much of the input as possible. When a quantifier (like *, +, or ?) is used in a pattern, the engine will attempt to repeat the quantified character or group as many times as it can.

Greedy mode is the default behavior in most regex engines, and it can sometimes lead to unexpected results if you're not careful, especially when working with patterns that have optional or repeatable parts.

Lazy Mode

Fortunately, we can change the mode to lazy mode to avoid these types of problems. Lazy mode, also known as non-greedy or reluctant mode in regular expressions, is the opposite of greedy mode. Instead of trying to match as much as possible, it tries to match as little as possible.

In simple terms, greedy mode is like trying to catch as many fish as possible in a net, lazy mode is like trying to catch just one fish at a time.

Lazy mode can be applied to the quantifiers by adding a question mark ? after the quantifier. Here's how it works with common quantifiers:

*?: Matches 0 or more occurrences, as few as possible.
+?: Matches 1 or more occurrences, as few as possible.
??: Matches 0 or 1 occurrence, as few as possible.
{n,}?: Matches at least 'n' occurrences, but as few as possible.
{n,m}?: Matches between 'n' and 'm' occurrences, but as few as possible.

Lazy mode can be incredibly useful when you want precise control over what is matched, especially when working with complex patterns that may have optional or repeatable elements.

Let's try applying lazy mode to our regular expression: ".+?".

In our PHP example, this would become the following:

$pattern = '/".+?"/';
 
$text = 'A "witch" and her "broom" is one.';
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]); // "witch", "broom"

This time, we're able to grab characters wrapped in quotes. Our regular expression is no longer grabbing characters outside of quotes.

Alternative Solution

With regular expressions, it's like having a toolbox with many tools. Often, there are several ways to solve the same problem.

For the problem we faced earlier, we don't have to use lazy mode for this. We can use a pattern like "[^"]+":

$pattern = '/"[^"]+"/';
 
$text = 'A "witch" and her "broom" is one.';
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]); // "witch", "broom"

This pattern, "[^"]+", is a bit like a recipe. It looks for a quote " first, then anything that's NOT a quote [^"], one or more times +, followed by the closing quote.

When the pattern finds the closing quote, it stops and says, "I've got it!" And we get the right answer. But remember, this isn't a replacement for lazy quantifiers. It's just a different tool in the toolbox. Sometimes you'll need one tool, sometimes the other. Just like cooking, the right tool depends on what you're trying to make!

Exercises

Exercise #1

Take the following:

$pattern = '/\d+? \d+?/';
 
$text = '123 456';
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]);

What do you think will be found in our pattern?

Exercise #2

Write a regular expression for finding HTML comments.

$pattern = '';
 
$text = '... <!-- My -- comment
 test --> ..  <!----> ..';
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]);

Exercise #3

Write a regular expression to find opening and closing HTML tags with attributes. You can use the following to help you test your expression:

$pattern = '';
 
$text = '<> <a href="/"> <input type="radio" checked> <b>';
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]);

You can safely assume attributes don't contain the characters < and >.

Key Takeaways

In most regular expression engines, quantifiers are greedy by default.
Common greedy quantifiers include *, +, ?, and {n,m}.
If the rest of the pattern doesn't match, the engine will backtrack and try shorter matches.
Lazy quantifiers try to match as little as possible.
Lazy behavior is invoked by adding a question mark after a quantifier, like *?, +?, {n,m}?.