regexp: Extract optional named tokens

Question

Hau Kit Yong on 2 Jul 2019

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/469930-regexp-extract-optional-named-tokens

Edited: Akira Agata on 3 Jul 2019

trip-data.txt

Open in MATLAB Online

I would like to extract some information from the following text:

There are 3 groups in the text. I want to extract the genders (enclosed in brackets), the group names (the text following 'Name:') and the student IDs for each group (the numbers following 'ID XX =').

My desired output is as follows:

The issue is that not all groups have a header line (the lines starting with '#'), e.g. for group 3.

My code is as follows

str = fileread('trip-data.txt');
expr = 'Student group.+?\((?<Gender>\w+?)\).*?Name:(?<Name>.+?)\nGROUP.+?=(?<IDs>.+?(,\s*\n.+?)*)(?=(\n|$))';
groups = regexp(str, expr, 'names');

The returned struct array ignores group 3:

I have also tried enclosing the header line in an optional bracket, e.g. '()?', like so

expr = '(Student group.+?\((?<Gender>\w+?)\).*?Name:(?<Name>.+?))?\nGROUP.+?=(?<IDs>.+?(,\s*\n.+?)*)(?=(\n|$))';

The returned struct captures the 'ID' fields but not the 'Gender' and 'Name' fields for all 3 groups:

2 Comments
Show NoneHide None

Rik on 2 Jul 2019

Do you absolutely need to use a regexp? Because it might be easier with other tools (if maybe slightly less efficient).

Hau Kit Yong on 2 Jul 2019

I would like to, yes, because the text is a small snippet of a much larger file with varying formats that I am already parsing with other expressions.

Sign in to comment.

Sign in to answer this question.

Answer 1

Akira Agata on 3 Jul 2019

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/469930-regexp-extract-optional-named-tokens#answer_381868

Edited: Akira Agata on 3 Jul 2019

Open in MATLAB Online

How about extracting 'Name', 'Gender' and 'ID' one-by-one?

The following is an example.

% Read the file
str = fileread('trip-data.txt');
% Remove newline in ID
str = regexprep(str,'\r\n\s+','');
% Remove newline after 'Name: XX'
str = regexprep(str,'(Name:\s+\w+)\r\n','$1, ');
% Store each line as a cell array
c = strsplit(str,'\r\n')';
% Extract one-by-one
Name = erase(regexp(c,'Name:\s(\w+)','match','once'),'Name: ');
Gender = regexp(c,'(male|female)','match','once');
ID = strtrim(extractAfter(c,'='));
% Summarize as a table
tbl = table(Name,Gender,ID);