Dynamic Regular Expressions

Introduction

In a dynamic expression, you can make the pattern that you want regexp to match dependent on the content of the input text. In this way, you can more closely match varying input patterns in the text being parsed. You can also use dynamic expressions in replacement terms for use with the regexprep function. This gives you the ability to adapt the replacement text to the parsed input.

You can include any number of dynamic expressions in the match_expr or replace_expr arguments of these commands:

regexp(text, match_expr)
regexpi(text, match_expr)
regexprep(text, match_expr, replace_expr)

As an example of a dynamic expression, the following regexprep command correctly replaces the term internationalization with its abbreviated form, i18n. However, to use it on a different term such as globalization, you have to use a different replacement expression:

match_expr = '(^\w)(\w*)(\w$)';

replace_expr1 = '$118$3';
regexprep('internationalization', match_expr, replace_expr1)

ans =

    'i18n'

replace_expr2 = '$111$3';
regexprep('globalization', match_expr, replace_expr2)

ans =

    'g11n'

Using a dynamic expression ${num2str(length($2))} enables you to base the replacement expression on the input text so that you do not have to change the expression each time. This example uses the dynamic replacement syntax ${cmd}.

match_expr = '(^\w)(\w*)(\w$)';
replace_expr = '$1${num2str(length($2))}$3';

regexprep('internationalization', match_expr, replace_expr)

ans =

    'i18n'

regexprep('globalization', match_expr, replace_expr)

ans =

    'g11n'

When parsed, a dynamic expression must correspond to a complete, valid regular expression. In addition, dynamic match expressions that use the backslash escape character (\) require two backslashes: one for the initial parsing of the expression, and one for the complete match. The parentheses that enclose dynamic expressions do not create a capturing group.

There are three forms of dynamic expressions that you can use in match expressions, and one form for replacement expressions, as described in the following sections

Dynamic Match Expressions — (??expr)

The (??expr) operator parses expression expr, and inserts the results back into the match expression. MATLAB^® then evaluates the modified match expression.

Here is an example of the type of expression that you can use with this operator:

chr = {'5XXXXX', '8XXXXXXXX', '1X'};
regexp(chr, '^(\d+)(??X{$1})$', 'match', 'once');

The purpose of this particular command is to locate a series of X characters in each of the character vectors stored in the input cell array. Note however that the number of Xs varies in each character vector. If the count did not vary, you could use the expression X{n} to indicate that you want to match n of these characters. But, a constant value of n does not work in this case.

The solution used here is to capture the leading count number (e.g., the 5 in the first character vector of the cell array) in a token, and then to use that count in a dynamic expression. The dynamic expression in this example is (??X{$1}), where $1 is the value captured by the token \d+. The operator {$1} makes a quantifier of that token value. Because the expression is dynamic, the same pattern works on all three of the input vectors in the cell array. With the first input character vector, regexp looks for five X characters; with the second, it looks for eight, and with the third, it looks for just one:

regexp(chr, '^(\d+)(??X{$1})$', 'match', 'once')

ans =

  1×3 cell array

    {'5XXXXX'}    {'8XXXXXXXX'}    {'1X'}

Commands That Modify the Match Expression — (??@cmd)

MATLAB uses the (??@cmd) operator to include the results of a MATLAB command in the match expression. This command must return a term that can be used within the match expression.

For example, use the dynamic expression (??@fliplr($1)) to locate a palindrome, “Never Odd or Even”, that has been embedded into a larger character vector.

First, create the input string. Make sure that all letters are lowercase, and remove all nonword characters.

chr = lower(...
  'Find the palindrome Never Odd or Even in this string');

chr = regexprep(chr, '\W*', '')

chr =

    'findthepalindromeneveroddoreveninthisstring'

Locate the palindrome within the character vector using the dynamic expression:

palindrome = regexp(chr, '(.{3,}).?(??@fliplr($1))', 'match')

palindrome =

  1×1 cell array

    {'neveroddoreven'}

The dynamic expression reverses the order of the letters that make up the character vector, and then attempts to match as much of the reversed-order vector as possible. This requires a dynamic expression because the value for $1 relies on the value of the token (.{3,}).

Dynamic expressions in MATLAB have access to the currently active workspace. This means that you can change any of the functions or variables used in a dynamic expression just by changing variables in the workspace. Repeat the last command of the example above, but this time define the function to be called within the expression using a function handle stored in the base workspace:

fun = @fliplr;

palindrome = regexp(chr, '(.{3,}).?(??@fun($1))', 'match')

palindrome =

  1×1 cell array

    {'neveroddoreven'}

Commands That Serve a Functional Purpose — (?@cmd)

The (?@cmd) operator specifies a MATLAB command that regexp or regexprep is to run while parsing the overall match expression. Unlike the other dynamic expressions in MATLAB, this operator does not alter the contents of the expression it is used in. Instead, you can use this functionality to get MATLAB to report just what steps it is taking as it parses the contents of one of your regular expressions. This functionality can be useful in diagnosing your regular expressions.

The following example parses a word for zero or more characters followed by two identical characters followed again by zero or more characters:

regexp('mississippi', '\w*(\w)\1\w*', 'match')

ans =

  1×1 cell array

    {'mississippi'}

To track the exact steps that MATLAB takes in determining the match, the example inserts a short script (?@disp($1)) in the expression to display the characters that finally constitute the match. Because the example uses greedy quantifiers, MATLAB attempts to match as much of the character vector as possible. So, even though MATLAB finds a match toward the beginning of the string, it continues to look for more matches until it arrives at the very end of the string. From there, it backs up through the letters i then p and the next p, stopping at that point because the match is finally satisfied:

regexp('mississippi', '\w*(\w)(?@disp($1))\1\w*', 'match')

i
p
p

ans =

  1×1 cell array

    {'mississippi'}

Now try the same example again, this time making the first quantifier lazy (*?). Again, MATLAB makes the same match:

regexp('mississippi', '\w*?(\w)\1\w*', 'match')

ans =

  1×1 cell array

    {'mississippi'}

But by inserting a dynamic script, you can see that this time, MATLAB has matched the text quite differently. In this case, MATLAB uses the very first match it can find, and does not even consider the rest of the text:

regexp('mississippi', '\w*?(\w)(?@disp($1))\1\w*', 'match')

m
i
s

ans =

  1×1 cell array

    {'mississippi'}

To demonstrate how versatile this type of dynamic expression can be, consider the next example that progressively assembles a cell array as MATLAB iteratively parses the input text. The (?!) operator found at the end of the expression is actually an empty lookahead operator, and forces a failure at each iteration. This forced failure is necessary if you want to trace the steps that MATLAB is taking to resolve the expression.

MATLAB makes a number of passes through the input text, each time trying another combination of letters to see if a fit better than last match can be found. On any passes in which no matches are found, the test results in an empty character vector. The dynamic script (?@if(~isempty($&))) serves to omit the empty character vectors from the matches cell array:

matches = {};
expr = ['(Euler\s)?(Cauchy\s)?(Boole)?(?@if(~isempty($&)),' ...
   'matches{end+1}=$&;end)(?!)'];

regexp('Euler Cauchy Boole', expr);

matches

matches =

  1×6 cell array

    {'Euler Cauchy Bo…'}    {'Euler Cauchy '}    {'Euler '}    {'Cauchy Boole'}    {'Cauchy '}    {'Boole'}

The operators $& (or the equivalent $0), $`, and $' refer to that part of the input text that is currently a match, all characters that precede the current match, and all characters to follow the current match, respectively. These operators are sometimes useful when working with dynamic expressions, particularly those that employ the (?@cmd) operator.

This example parses the input text looking for the letter g. At each iteration through the text, regexp compares the current character with g, and not finding it, advances to the next character. The example tracks the progress of scan through the text by marking the current location being parsed with a ^ character.

(The $` and $´ operators capture that part of the text that precedes and follows the current parsing location. You need two single-quotation marks ($'') to express the sequence $´ when it appears within text.)

chr = 'abcdefghij';
expr = '(?@disp(sprintf(''starting match: [%s^%s]'',$`,$'')))g';

regexp(chr, expr, 'once');

starting match: [^abcdefghij]
starting match: [a^bcdefghij]
starting match: [ab^cdefghij]
starting match: [abc^defghij]
starting match: [abcd^efghij]
starting match: [abcde^fghij]
starting match: [abcdef^ghij]

Commands in Replacement Expressions — ${cmd}

The ${cmd} operator modifies the contents of a regular expression replacement pattern, making this pattern adaptable to parameters in the input text that might vary from one use to the next. As with the other dynamic expressions used in MATLAB, you can include any number of these expressions within the overall replacement expression.

Commands in replacement expressions only check the local workspace for variables. Caller and global workspaces are not available to commands in replacement expressions

In the regexprep call shown here, the replacement pattern is '${convertMe($1,$2)}'. In this case, the entire replacement pattern is a dynamic expression:

regexprep('This highway is 125 miles long', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}');

The dynamic expression tells MATLAB to execute a function named convertMe using the two tokens (\d+\.?\d*) and (\w+), derived from the text being matched, as input arguments in the call to convertMe. The replacement pattern requires a dynamic expression because the values of $1 and $2 are generated at runtime.

The following example defines the file named convertMe that converts measurements from imperial units to metric.

function valout  = convertMe(valin, units)
switch(units)
    case 'inches'
        fun = @(in)in .* 2.54;
        uout = 'centimeters';
    case 'miles'
        fun = @(mi)mi .* 1.6093;
        uout = 'kilometers';
    case 'pounds'
        fun = @(lb)lb .* 0.4536;
        uout = 'kilograms';
    case 'pints'
        fun = @(pt)pt .* 0.4731;
        uout = 'litres';
    case 'ounces'
        fun = @(oz)oz .* 28.35;
        uout = 'grams';
end
val = fun(str2num(valin));
valout = [num2str(val) ' ' uout];
end

At the command line, call the convertMe function from regexprep, passing in values for the quantity to be converted and name of the imperial unit:

regexprep('This highway is 125 miles long', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')

ans =

    'This highway is 201.1625 kilometers long'

regexprep('This pitcher holds 2.5 pints of water', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')

ans =

    'This pitcher holds 1.1828 litres of water'

regexprep('This stone weighs about 10 pounds', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')

ans =

    'This stone weighs about 4.536 kilograms'

As with the (??@ ) operator discussed in an earlier section, the ${ } operator has access to variables in the currently active workspace. The following regexprep command uses the array A defined in the base workspace:

A = magic(3)

A =

     8     1     6
     3     5     7
     4     9     2

regexprep('The columns of matrix _nam are _val', ...
          {'_nam', '_val'}, ...
          {'A', '${sprintf(''%d%d%d '', A)}'})

ans =

    'The columns of matrix A are 834 159 672'