Extract rectangular data from a non-rectangular file with header and convert to a structure of column vectors where field names are the second row of the rectangular data
Show older comments
I am trying to read a text file that has a header of varying length due to some options that can be turned on. Below the header is rectangular data.
The first row of the rectangular data is unimportant to me and can be removed. The second row contains information that corresponds with the columns below it. I would like each of the strings in the second row of rectangular data to become field names for my structure.
Then I would like the corresponding numbers in the columns of data from the third line of rectangular data until the end to be vectors that are added into each field.
I have attached a shortened sample file that I am trying to perform this on to no avail. The actual data file has 172 columns (this can vary depending on the parameters selected) and is ~50k rows long (can also vary). I have tried writing a loop using fgetl and strsplit, which seems to be a usable option, but it is incredibly slow. Textscan seems to be a much faster option, but I am really struggling to figure out how to use its options to make this work.
So far, I don't have much working with textscan.
fid = fopen('sample_text.txt');
C = textscan(fid,'%*s','Delimiter', '\n','CollectOutput', true);
fclose(fid);
Right now, this returns an empty array, and I'm not quite sure what it is actually doing. I just pulled it from the example on using textscan for non-rectangular data. Any help or direction would be much appreciated.
6 Comments
Shawn
on 17 Oct 2017
And how fast was the following solution? I slightly updated it for accounting for our latest comments.
tic ;
% - Parse file.
fId = fopen('sample_text.txt','r') ;
for k = 1 : 4
fgetl(fId) ;
end
vars = lower(regexp(fgetl(fId), '\S+', 'match')) ;
data = reshape(fscanf( fId, '%f', Inf), numel(vars), []).' ;
fclose(fId) ;
% - Build output struct.
vars = strrep(vars, '+', '_plus_') ;
s = struct() ;
for k = 1 : numel(vars)
s.(vars{k}) = data(:,k) ;
end
toc
Shawn
on 18 Oct 2017
Cedric
on 18 Oct 2017
Try using the profiler if you never used it. It would be a good context for learning using it. In the command window, type:
profile viewer
In the field "Run this code", type the name of your script (M-File), and click on [Start Profiling]. You will get a report and if you click on the name of the script in the table, you will see what takes times (ranked). You will also see the highlighted code at the bottom with colors that indicate where the time is spent.
I'm surprised that a solution based on SSCANF and a single RESHAPE is slower that your approach though.
Shawn
on 19 Oct 2017
Cedric
on 20 Oct 2017
If your source text data file is not too confidential, you can send it to me (you got my email each time I sent you a message indicating that I posted a comment), and I can see quickly if I can speedup the processing.
Accepted Answer
More Answers (0)
Categories
Find more on Data Type Conversion in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!