Can REGEXP map values from different parts of a text file?

I have a text file with the following contents:
MSNout_BER (0:31) Observation #100 Rx'd at: (58568.000) Msg. Time: (58568.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rel Mode: Active
MSNout_SSS (0:32) Observation #101 Rx'd at: (58569.000) Msg. Time: (58569.000)
Forward to IRU: true Rcv Date: 2010121 Synch: a0a0 Bel Mode: High
Type: 12 Malck ID: 12345 Time Tag: 58548.12345678
Hand ID: 0 SV ID: 51 Spam ID: 0 BOZ/FAS: 0 Realt Flag: 0
MSNout_BER (0:33) Observation #102 Rx'd at: (58570.000) Msg. Time: (58570.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rel Mode: Active
MSNout_SSS (0:34) Observation #103 Rx'd at: (58571.000) Msg. Time: (58571.000)
Forward to IRU: true Rcv Date: 2010121 Synch: a0a0 Bel Mode: High
Type: 1 Malck ID: 12345 Time Tag: 58549.12345678
Hand ID: 1 SV ID: 2 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58550.12345678
Hand ID: 1 SV ID: 2 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58551.12345678
Hand ID: 1 SV ID: 2 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58552.12345678
Hand ID: 1 SV ID: 2 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58553.12345678
Hand ID: 1 SV ID: 1 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58554.12345678
Hand ID: 1 SV ID: 1 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58555.12345678
Hand ID: 1 SV ID: 1 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58556.12345678
Hand ID: 1 SV ID: 3 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
I’m using the following commands to retrieve the values for the Time Tag: and SV ID: (values 1 and 2 only, all others are ignored);
[fn,pn] = uigetfile('*.txt,"Select Text File');
OAMfilename = fullfile(pn, fn);
buffer = fileread(OAMfilename);
pattern = '*?Tag:\s+([\d\.]+).*?SV ID:\s+([12])\W';
tokens = regexp(buffer, pattern, 'tokens');
data = reshape(str2double([tokens{:}]), 2, []).';
Results:
58548.1234567800 2
58550.1234567800 2
58551.1234567800 2
58552.1234567800 2
58553.1234567800 1
58554.1234567800 1
58555.1234567800 1
Initially, I thought the results were as expected. Then I noticed the time tag for the first occurrence of SV ID equal to 2 was wrong - 58549.12345678 is the proper time tag.
Is it possible to force MATLAB to recognize each Time Tag value that occurs just prior to each SV ID value? Could a Lookaround operator be used in this case?

 Accepted Answer

This seems to work.
buf = fileread( 'cssm.txt' );
rex = '(?<=Time Tag: )([\d\.]+).+?(?<=SV ID:[ ]+)(\d+)';
cac = regexp( buf, rex, 'tokens' );
cac{:}
returns
ans =
'58548.12345678' '51'
ans =
'58549.12345678' '2'
ans =
'58550.12345678' '2'
ans =
'58551.12345678' '2'
ans =
'58552.12345678' '2'
ans =
'58553.12345678' '1'
ans =
'58554.12345678' '1'
ans =
'58555.12345678' '1'
ans =
'58556.12345678' '3'
where cssm.txt contains your data
.
Comments on the regular expression:
  • capture tokens
  • capture the group of digits, which follow after identifiers and space
  • the "identifiers and space" are used as expressions in look behind operators
  • thus two groups of (?<= name)( value)
  • between these two groups: .+?, which is a Lazy Quantifier. It advances the current position one position or more, but only as much of the quantified expression as necessary.
  • the regular expression must match one sub-string, thus something is needed to match the characters between the two groups to make the two one sub-string. In this case that is done by .+?.
Most of the italic words are copy&paste from the on-line help.
.
BTW: Your pattern works - after a little fixing:
rex = '*?Tag:\s+([\d\.]+).*?SV ID:\s+([125]{1,2})\W';
but what is the purpose of the leading *? and the trailing \W ?
.
A bit more robust:
rex = '(?<=Time Tag:)[ ]+([\d\.]+)[^\n]+?(?<=SV ID:)[ ]+(\d+)';
  • Replacing \s+ between name and value by [ ]+ excludes new-line, tab, etc.
  • Replacing .*? between the two name-value-pairs by |[^

9 Comments

Per: Agreed, this approach does work. But how do I interpret the expression? Is it simply 2 lookbehind operators? What is the purpose of the 2nd ? in this expression?
Per, I'm just now getting back to this. I noticed that while the answer you provide does work, I'm having a tough time getting the last piece of the puzzle in place - retrieving the values for the Time Tag: and SV ID: (values 1 and 2 only, all others are ignored). I've tried the following expressions, but manage to keep getting the same result:
%exp = '(?<=Time Tag: )([\d\.]+).+?(?<=SV ID:[ ]+)(\d+)'; %Maps Time Tags with all SV IDs
%exp = '*?Tag:\s+([\d\.]+).*?SV ID:\s+([12])\W'; %Doesn't map the 1st Time Tag and SV ID 2
%exp = '(?<=Time Tag: )([\d\.]+).*?SV ID:\s+([12])\W'; % Same as above even without the\W
%exp = '*?Tag:\s+([\d\.]+).*?SV ID:\s+([12]{1,2})\W'; % Same as above
%exp = '(?<=Time Tag:)[ ]+([\d\.]+)[^\n]+?(?<=SV ID:)[ ]+(\d+)'; %Doesn't work
Any ideas on how to do this?
  • There is seldom one unique regular expression that works - one the contrary many more or less comprehensible ones works.
  • "retrieving the values for the Time Tag: and SV ID: (values 1 and 2 only, all others are ignored)." Isn't that what my solution does? I don't understand what you try to achieve.
  • I'm not sure where exactly the line breaks are in your file.
  • Is this what you are looking for? I have only replaced the last (\d+) by ([12]). ([12])\W would be better.
rex = '(?<=Time Tag:)[ ]+([\d\.]+)[^\n]+?(?<=SV ID:)[ ]+([12])\W';
cac{:} now returns
ans =
'58549.12345678' '2'
ans =
'58550.12345678' '2'
ans =
'58551.12345678' '2'
ans =
'58552.12345678' '2'
ans =
'58553.12345678' '1'
ans =
'58554.12345678' '1'
ans =
'58555.12345678' '1'
Why not simply
'([\d\.]+)\s+Hand.+?SV ID:\s+(\d+)'
as a pattern? It would have the advantage not to include any look behind/forward.
Per, I may have an issue with my PC or MATLAB. I tried using your code and still can't get the same result you show above. What you list above is exactly what I'm trying to do - display only the time tag and SV IDs for values of 1 and 2. But each time I execute the code, cac is being shown as a < 0 x 0 cell> in the workspace.
Cedric, your suggested expression does work. It properly maps the time tags to all of the SV IDs. However, when I attempt to map the time tags to ONLY the SV IDs 1 and 2, I get the following result;
58548.1234567800 2 - one time tag off
58550.1234567800 2
58551.1234567800 2
58552.1234567800 2
58553.1234567800 1
58554.1234567800 1
58555.1234567800 1
Quick comments:
  • I use R2012a
  • When copy&pasting the text to a text file, I have removed some line breaks
  • I don't understand why '([\d\.]+)\s+Hand.+?SV ID:\s+(\d+)' does not match the line with SV ID: followed by 51.
  • I prefer to use '(?<=Name)[ ]+(Value) to read name-value-pairs. I think it makes "better" code; it communicates intent better.
  • I have not read "Mastering Regular Expressions".
  • I think the expressions should be as selective as possible. Regular expressions often cause problems in my code; old code in combination with new text files produces unexpected results.
Does it help to replace the \W by \D?
Actually
'([\d\.]+)\s+Hand.+?SV ID:\s+(\d+)'
does match SV ID 51.
What was wrong with your initial pattern is that the first match is the whole:
Tag: 58548.12345678
Hand ID: 0 SV ID: 51 Spam ID: 0 BOZ/FAS: 0 Realt Flag: 0
MSNout_BER (0:33) Observation #102 Rx'd at: (58570.000) Msg. Time: (58570.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rel Mode: Active
MSNout_SSS (0:34) Observation #103 Rx'd at: (58571.000) Msg. Time: (58571.000)
Forward to IRU: true Rcv Date: 2010121 Synch: a0a0 Bel Mode: High
Type: 1 Malck ID: 12345 Time Tag: 58549.12345678
Hand ID: 1 SV ID: 2
(which gives time=58548.12345678 and SVID=2)
If you want to select only those with SV IDs 1 and 2, you can use
'([\d\.]+)\s+Hand[^B]+?SV ID:\s+([12])'
which works based on the fact that there is no 'B' in between the time tag and the SV ID (it appears only after the SV ID in 'BOZ'). You could also use an expression that prevents another 'Time Tag' to appear in between the initial time tag and the SV ID, or limit the number of characters in between the tie tag and the SV ID (i.e. replace .+? with .{1,45}), but I think that ^B is simpler. Of course, you could just stick to the expression which matches all entries and then filter out those with SV IDs not in {1,2} after conversion to numeric.
Per, Cedric, after re-installing MATLAB I'm getting the proper results. I tried both approaches provided by the 2 of you and they run like a champ. Thanks for the help on this.

Sign in to comment.

More Answers (0)

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!