How to parse an Nx1 string array without looping through N

6 views (last 30 days)
I have an Nx1 string array, and I can't figure out how to extract 6 chunks of text out of it and into an Nx6 cell array. The text elements are numbers, but it's simplest to not treat them as numbers at this juncture.
Here is a toy version of the string array, together with code that correctly parses out the necessary elements of CCYYMMDD and hhmm from the first element of the string array:
stringFile = ["nsasondewnpnC1.b1.20020428.184800.cdf"; ...
"nsasondewnpnC1.b1.20020428.220500.cdf"; ...
"nsasondewnpnC1.b1.20020428.235900.cdf"; ...
"nsasondewnpnC1.b1.20020429.013100.cdf"; ...
"nsasondewnpnC1.b1.20020429.182500.cdf"];
charLaunch = textscan(stringFile(1),'%*18c %2c %2c %2c %2c %*c %2c %2c');
charLaunch =
1×6 cell array
{'20'} {'02'} {'04'} {'28'} {'18'} {'48'}
However, both
charLaunchAll = textscan(stringFile,'%*18c %2c %2c %2c %2c %*c %2c %2c');
and
charLaunchAll = cell(5,6);
charLaunchAll = textscan(stringFile(:),'%*18c %2c %2c %2c %2c %*c %2c %2c');
generate the same error message:
Error using textscan
First input must be a valid file-id or non-empty character vector.
Is there a way to extract these pieces of texts out of every array member without building a loop?

Accepted Answer

Stephen23
Stephen23 on 23 Apr 2020
Edited: Stephen23 on 23 Apr 2020
Using one simple regular expression:
C = {...
'nsasondewnpnC1.b1.20020428.184800.cdf'; ...
'nsasondewnpnC1.b1.20020428.220500.cdf'; ...
'nsasondewnpnC1.b1.20020428.235900.cdf'; ...
'nsasondewnpnC1.b1.20020429.013100.cdf'; ...
'nsasondewnpnC1.b1.20020429.182500.cdf'};
out = regexp(C,'\d{2}','match');
out = vertcat(out{:})
I used a cell array of character vectors, but it will also work for a string array.
  5 Comments
Stephen23
Stephen23 on 23 Apr 2020
Edited: Stephen23 on 23 Apr 2020
"... why textscan will work with a single element of a string array, but not with an entire array of strings?"
Because low-level string parsing functions parse one string element or one character vector, and textscan is ultimately just a fancy wrapper for low-level operations.
You might think of a string array as one thing, but really it is a container array of multiple character vectors, i.e. it contains lots of individual, separate character vectors, which are stored separately. Not so different from a cell array, really (search this forum for more accurate and detailed discussions on how string arrays are actually implemented).
Parsing a string array introduces ambiguities: e.g. what is the end-of-line character? textscan relies on identifying that character... but parsing a string array would (possibly, see below) require having no EOL character at all, and instead treating each string element as being de-facto delimited by some character (in which case you can trivially do this yourself, as I did in my last comment). You might think it is obvious that each string element should be treated as one line, but computers do not understand "obvious", they understand instructions in the form of code. Consider how this 2x1 string array should be parsed:
str = ["1";"2\n3"] % \n = newline
which of these should textscan(str,'%f') return?:
  • [1;2;3] all values, identify both newline AND different string elements as having de-facto EOL.
  • [1;2] newline causes parsing to finish.
  • [1] second element does not parse.
  • {[1];[2;3]} the output is not of the class requested, and the cell contents can have an arbitrary size.
  • error second element throws an error.
If you say the first is the correct behavior, what about the next user who expects one of the other behaviors?
Note also that text files also consist of one long character vector (people think of them as having "lines", but really they are all one long character vector interspersed with newline characters), and low-level file parsing functions also parse just that one character vector.
Leslie
Leslie on 23 Apr 2020
Edited: Leslie on 23 Apr 2020
OK, thanks. I'd noticed that what I was trying to do "all at once" would have worked if I'd been reading a file and could have searched for the newline character, but didn't (or couldn't) carry that all the way forward to understanding how the string array was being stored. It just never occurred to me to do something like "ignore through the 'cdf' at the end of the string", which is an analog to the documentation's example of "ignore the rest of the line".

Sign in to comment.

More Answers (1)

Mohammad Sami
Mohammad Sami on 23 Apr 2020
Since the pattern in your string seems to be the same, you can use the format specification to convert the string directly to datetime as follows.
stringFile = ["nsasondewnpnC1.b1.20020428.184800.cdf"; ...
"nsasondewnpnC1.b1.20020428.220500.cdf"; ...
"nsasondewnpnC1.b1.20020428.235900.cdf"; ...
"nsasondewnpnC1.b1.20020429.013100.cdf"; ...
"nsasondewnpnC1.b1.20020429.182500.cdf"];
fmt = "'nsasondewnpnC1.b1.'yyyyMMdd'.'HHmmss'.cdf'";
% the constant portion of your string is enclosed in 'single quotes';
d = datetime(stringFile,'InputFormat',fmt);
  1 Comment
Leslie
Leslie on 23 Apr 2020
Thanks, interesting useage that I didn't know about.
But I don't really want it in datetime format; I'd like the 2-digit text chunks. If I've got to clutter up my code with sending it to datetime & back, I might as well write the stupid loop. (I'm not meaning to be cranky at you; I'm just cranky that I spent a few hours today poring over documentation and Answers to do something that it seems I ought to be able to do!)

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!