How to effectively use look ahead with regexp?

5 views (last 30 days)
Hi all,
I'm doing some coding with regular expressions, but there are a couple of things I can't understand. Look at the following
1. searching the letter 'r' followed by a number:
regexp('19f/4r power shift','(?<=\d*) ?r')
ans =
6 12
regexp('19f/4r power shift','(?<=\d)\s?r')
ans =
6
Why the '*' change so much the result? The 'r' at the 12th position is not followed by any number.
2- Searching for the word 'Reverser' that is not preceded by the words 'power' or 'powr'.
regexp('power Reverser','(?<!powe?r) *-? *Reverser','match')
ans =
' Reverser'
Reverser is preceded by the string 'power', so it shouldn't be selected.
Why do these occur?
Thanks
Best regards,
Pietro

Accepted Answer

Stephen23
Stephen23 on 26 Jun 2017
Edited: Stephen23 on 26 Jun 2017
1. "searching the letter 'r' followed by a number." Actually you seem to be wanting to search for the letter 'r' preceded by a number, not "followed by". Only the second of your regexps does this. By adding the * to the first regexp you make the digits optional (the asterisk matches zero or more times!) So clearly the second r in that short string matches your first regular expression: it constitutes an 'r' preceded by zero spaces (permitted by the ?) and by zero digits (permitted by the *).
You could use + (match one or more) rather than * (match zero or more):
regexp('19f/4r power shift','(?<=\d+)\s?r')
but this is not really necessary: matching one digit is enough because if there are multiple digits then there is also one digit.
2. This is a much more subtle problem. The basic problem here is the optimism of regular expressions, and that * on the space character. What happens is that the regular expression parser keeps on trying new combinations to match as much of the string as possible, which clearly differs from how you perceive its operation (you want it to quit after matching that lookaround once).
The regular expression will correctly match 'power', but then it notices that you placed an asterisk * on the space. When it tries, for example, one space character preceding that word then your lookaround is satisfied: if it matches one space with the optional spaces ' *' regex, then the look around is also satisfied because what precedes that one space? Another space character! Therefore the lookaround is happy (one space is not equal to 'power'), and the regular expression parser is happy because it wants to match as much of the string as possible. Therefore it picks this option.
Basically what you seem to want is a pessimistic parser (you want to return no match if any one combination is a match to that lookaround, even if others do not match the lookaround), but in reality regexp parsers are optimistic: they return a match if any one combination is a match. They reject the one case that you are interested in because other cases better fulfill their basic operational principal: match as much as possible, however it can.
To see what parts of the strings are matched you should look at using a dynamic regular expression, e.g. adding:
(?@disp($1))
into your regexp and seeing how the string is parsed.
Do you really need to match an unknown number of space characters?
  2 Comments
Stephen23
Stephen23 on 27 Jun 2017
Edited: Stephen23 on 27 Jun 2017
You could move the space inside the lookaround:
>> regexp('power Reverser','(?<!powe?r *)Reverser','match')
ans =
{}
>> regexp('power X Reverser','(?<!powe?r *)Reverser','match')
ans =
'Reverser'

Sign in to comment.

More Answers (0)

Tags

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!