Why is my regular expression always greedy?
Show older comments
I have the following string, read into MATLAB:
*aaa
$bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
1111111111111111111111
222222222222
3333333333333333333333333
4444444444555556666666
777777788899999
*ddd
$11111111111111111111111111111111
222222222222222abcdf
99999999999
*abcde99999
$eeeeeeeeeeeeeeeeeeeeee
I would like to perform a search that only extracts the text between *aaa and *ddd, using the following regexp pattern:
pattern = '(?<=\*aaa\s)(.*|\n)*?(?=\*)';
I expected the middle (.*|\n)*? to match the minimum number of "either any character other than linebreak, or a linebreak" that sits between *aaa and the closest * symbol, at *ddd. Instead, MATLAB returns the following:
$bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
1111111111111111111111
222222222222
3333333333333333333333333
4444444444555556666666
777777788899999
$11111111111111111111111
*ddd
$11111111111111111111111111111111
222222222222222abcdf
99999999999
Instead of stopping at just before *ddd, regexp continued until just before *abcde99999, despite the presence of the "?" at the end of the middle section of the pattern.
Just to make sure this isn't a lookaround issue, I also tried running
pattern = '\*(.*|\n)*?\*';
And sure enough, I get the following, with the *ddd in the middle being skipped entirely:
*aaa
$bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
1111111111111111111111
222222222222
3333333333333333333333333
4444444444555556666666
777777788899999
$11111111111111111111111
*ddd
$11111111111111111111111111111111
222222222222222abcdf
99999999999
*
Accepted Answer
More Answers (0)
Categories
Find more on Operations on Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!