MATLAB Answers

regexp: what am I missing from the documentation?

3 views (last 30 days)
I have tried to carefully read the regexp documentation, and I am able to sucessfully implement regexp in the simplest cases. For example, given:
test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8'
I can use the following code to retrieve each of the separate names, with the ending numeral and/or whitespace:
exp = '\w*[^1-9\s]';
MyMatch = regexp(test, exp, 'match')
MyMatch = 1×8 cell array
Columns 1 through 6
{'John'} {'Ron'} {'James'} {'Dongo'} {'Chloe'} {'Billgo'}
Columns 7 through 8
{'Marie'} {'Aaron'}
However, despite much effort, I cannot achieve a more complex result (example provided below). I try to limit the number of questions I post to the community, but here is a situation where I ask if the experts can point to where I am erring in my use of regexp to give a (slightly more complex) result. Note that this is not a specific problem I am trying to solve. I merely invented a 'random' problem in an effort to become more adpept in my use of regexp.
For the following example, assume that all name instances in a character vector test have one of two possible problems.
  1. A single digit immediately follows the name (e.g., James7)
  2. The name has 'go' appended to its end.
NB: We know in advance there are no name instances in test that would require us to consider the possibility that 'go' is just the natural ending of a name instance (e.g., Hugogo).
Thus, given the character vector:
test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8'
The desired output is:
MyMatch = 1×8 cell array
Columns 1 through 6
{'John'} {'Ron'} {'James'} {'Don'} {'Chloe'} {'Bill'}
Columns 7 through 8
{'Marie'} {'Aaron'}
Examples of attempted (and failed) solutions:
% Given the documentation's statement, 'If you specify a lookahead assertion before an expression,
% the operation is equivalent to a logical AND."
MyMatch = regexp(test, '(?<=\w*[^*go\s)\w*[^1-9\s]', 'match')
% Attempts to implement 'OR' logic: (exp|exp)
% (1)
[tok, mat] = regexp(test, '(\w+)([^*go\s]|[^1-9\s])', 'tokens', 'match');
vertcat(tok{:}) % then extract col1
% (2)
[tok, mat] = regexp(test, '((\w+)([^*go\s]))|((\w+)([^1-9\s]))', 'tokens', 'match')
vertcat(tok{:}) % then extract col1
% ...
And so on and so forth...
  1. What is your approach/solution (using regexp) to the above? Is it better to take a multipronged approach? e.g., convert to cell array first, use two regexp, etc..
  2. What is your approach/solution (using regexp) given:
test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8 Hugogo' % note Hugogo
% we want the 'MyMatch' or 'MyTokens' cell array to contain 'Hugo'
Thanks for your time, and Happy New Year!
Sincerely,
Ray

  4 Comments

Show 1 older comment
Raymond MacNeil
Raymond MacNeil on 28 Dec 2019
@dpb
I hear you. I really think MATLAB's documentation of regexp could mightily benefit from additional examples which involve more nuanced problems. And for some of the examples that are provided, we lack complete context, and therefor we have to make a number of assumptions which may or may not hold with the particular problem we are trying to tackle.
Cheers,
Ray
Stephen Cobeldick
Stephen Cobeldick on 28 Dec 2019
The regexp documentation's focus is rather on the function rather than the regular expression syntax. For more detailed explanations of the syntax see:
You might also like to download my FEX submission iregexp, which creates an interactive figure for trying different regular expressions and parse strings, and seeing regexp's outputs:
Raymond MacNeil
Raymond MacNeil on 28 Dec 2019
Thanks, Stephen. I have previously examined these additonal pages, but I should probably dig into these more.

Sign in to comment.

Accepted Answer

Stephen Cobeldick
Stephen Cobeldick on 28 Dec 2019
Edited: Stephen Cobeldick on 28 Dec 2019
A direct interpretation of your description "assume that all name instances in a character vector test have one of two possible problems. 1. A single digit immediately follows the name (e.g., James7) 2. The name has 'go' appended to its end." is to use one lookahead assertion:
>> test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8 Hugogo';
>> regexp(test,'\w+(?=(\d|go)\>)','match')
ans =
'John' 'Ron' 'James' 'Don' 'Chloe' 'Bill' 'Marie' 'Aaron' 'Hugo'
Or similarly using a non-captured token:
>> tkn = regexpi(test,'(\w+)(?:\d|go)\>','tokens');
>> [tkn{:}]
ans =
'John' 'Ron' 'James' 'Don' 'Chloe' 'Bill' 'Marie' 'Aaron' 'Hugo'

More Answers (0)

Sign in to answer this question.