Main Content

Lookahead Assertions in Regular Expressions

Lookahead Assertions

There are two types of lookaround assertions for regular expressions: lookahead and lookbehind. In both cases, the assertion is a condition that must be satisfied to return a match to the expression.

A lookahead assertion has the form (?=test) and can appear anywhere in a regular expression. MATLAB® looks ahead of the current location in the text for the test condition. If MATLAB matches the test condition, it continues processing the rest of the expression to find a match.

For example, look ahead in a character vector specifying a path to find the name of the folder that contains a program file (in this case, fileread.m).

chr = which('fileread')
chr =

    'matlabroot\toolbox\matlab\iofun\fileread.m'
regexp(chr,'\w+(?=\\\w+\.[mp])','match')
ans =

  1×1 cell array

    {'iofun'}

The match expression, \w+, searches for one or more alphanumeric or underscore characters. Each time regexp finds a term that matches this condition, it looks ahead for a backslash (specified with two backslashes, \\), followed by a file name (\w+) with an .m or .p extension (\.[mp]). The regexp function returns the match that satisfies the lookahead condition, which is the folder name iofun.

Overlapping Matches

Lookahead assertions do not consume any characters in the text. As a result, you can use them to find overlapping character sequences.

For example, use lookahead to find every sequence of six nonwhitespace characters in a character vector by matching initial characters that precede five additional characters:

chr = 'Locate several 6-char. phrases';
startIndex = regexpi(chr,'\S(?=\S{5})')
startIndex =

     1     8     9    16    17    24    25

The starting indices correspond to these phrases:

Locate   severa   everal   6-char   -char.   phrase   hrases

Without the lookahead operator, MATLAB parses a character vector from left to right, consuming the vector as it goes. If matching characters are found, regexp records the location and resumes parsing the character vector from the location of the most recent match. There is no overlapping of characters in this process.

chr = 'Locate several 6-char. phrases';
startIndex = regexpi(chr,'\S{6}')
startIndex =

     1     8    16    24

The starting indices correspond to these phrases:

Locate   severa   6-char   phrase

Logical AND Conditions

Another way to use a lookahead operation is to perform a logical AND between two conditions. This example initially attempts to locate all lowercase consonants in a character array consisting of the first 50 characters of the help for the normest function:

helptext = help('normest');
chr = helptext(1:50)
chr =

    ' NORMEST Estimate the matrix 2-norm.
         NORMEST(S'

Merely searching for non-vowels ([^aeiou]) does not return the expected answer, as the output includes capital letters, space characters, and punctuation:

c = regexp(chr,'[^aeiou]','match')
c =

  1×43 cell array

  Columns 1 through 14

    {' '}    {'N'}    {'O'}    {'R'}    {'M'}    {'E'}    {'S'}    {'T'}    {' '}    {'E'}    {'s'}    {'t'}    {'m'}    {'t'}

  Columns 15 through 28

    {' '}    {'t'}    {'h'}    {' '}    {'m'}    {'t'}    {'r'}    {'x'}    {' '}    {'2'}    {'-'}    {'n'}    {'r'}    {'m'}

  Columns 29 through 42

    {'.'}    {'↵'}    {' '}    {' '}    {' '}    {' '}    {'N'}    {'O'}    {'R'}    {'M'}    {'E'}    {'S'}    {'T'}    {'('}

  Column 43

    {'S'}

Try this again, using a lookahead operator to create the following AND condition:

(lowercase letter) AND (not a vowel)

This time, the result is correct:

c = regexp(chr,'(?=[a-z])[^aeiou]','match')
c =

  1×13 cell array

    {'s'}    {'t'}    {'m'}    {'t'}    {'t'}    {'h'}    {'m'}    {'t'}    {'r'}    {'x'}    {'n'}    {'r'}    {'m'}

Note that when using a lookahead operator to perform an AND, you need to place the match expression expr after the test expression test:

(?=test)expr or (?!test)expr

See Also

| |

Related Topics