Extract rectangular data from a non-rectangular file with header and convert to a structure of column vectors where field names are the second row of the rectangular data

I am trying to read a text file that has a header of varying length due to some options that can be turned on. Below the header is rectangular data.
The first row of the rectangular data is unimportant to me and can be removed. The second row contains information that corresponds with the columns below it. I would like each of the strings in the second row of rectangular data to become field names for my structure.
Then I would like the corresponding numbers in the columns of data from the third line of rectangular data until the end to be vectors that are added into each field.
I have attached a shortened sample file that I am trying to perform this on to no avail. The actual data file has 172 columns (this can vary depending on the parameters selected) and is ~50k rows long (can also vary). I have tried writing a loop using fgetl and strsplit, which seems to be a usable option, but it is incredibly slow. Textscan seems to be a much faster option, but I am really struggling to figure out how to use its options to make this work.
So far, I don't have much working with textscan.
fid = fopen('sample_text.txt');
C = textscan(fid,'%*s','Delimiter', '\n','CollectOutput', true);
fclose(fid);
Right now, this returns an empty array, and I'm not quite sure what it is actually doing. I just pulled it from the example on using textscan for non-rectangular data. Any help or direction would be much appreciated.

6 Comments

Since str2num and str2double are incredibly slow, I spent a while trying to speed up the code. It was taking 3 minutes to apply either of those to my cell array one time. Using the following, I was able to speed up reading a 48002 line text file with 172 columns (except for the header) from 6.5 minutes to 28 seconds.
clearvars
close all
% clc
tstart=tic;
%%Open the file and read it using textscan
%file name as string
fname = 'sample_text.txt';
%open the file as read only
fid = fopen(fname,'r');
%scan file to cell array as strings using a row delimiter of newline
%character
celltext=textscan(fid,'%s','delimiter','\n');
%close the file
fclose(fid);
%%Find the Number of Header Lines in the File
%Create a cell array of #rows x 1 with each cell containing a column vector
%of the numeric values from its row in the text file - if a row has strings, it
%will appear as empty
notnumcell=cellfun(@(s)sscanf(s,'%f'), celltext{1,1},'UniformOutput',false);
%find the logical vector corresponding to the empty rows (1 is empty, 0 is
%not)
strloc=cellfun('isempty',notnumcell);
%in each rows x 1 cell, split the string and then find the length of the
%split string (this represents the number of columns in each row of the
%data file) - convert the cell containing lengths to a column vector where
%each row contains the number of columns in that row
cellsize=cell2mat(cellfun(@length,cellfun(@strsplit,celltext{1,1},'UniformOutput',false),'UniformOutput',false));
%find max value of the columns
maxcol=max(cellsize);
%find a logical vector corresponding to rows that have the number of
%columns equal to maxcol (1 has maxcol, 0 has different # of columns)
maxcolind=cellsize==maxcol;
%the row number of the variable names and also the number of header lines
%in the file corresponds to the final line that has maxcols and is also a
%string - this returns the index number of that row
hlines=find(strloc&maxcolind,1,'last');
%%Find Parameter Names
%Use hlines to find the nested cell row from celltext that contains the
%parameter names and split the string, so that there is one cell for each
%parameter - these will be used as the field names for the structure
varnames=strsplit(celltext{1,1}{hlines,1});
% %replace '+' and '-' in varnames with 'plus' and 'minus' since those
% %characters cannot be in field names
varnames=strrep(strrep(varnames,'+','plus'),'-','minus');
%%Import Data after Headerlines as Numeric
%open the file as read only
fid = fopen(fname,'r');
%read the rest of the file after the headerlines into a cell array as
%floats and then convert to a matrix
valmat=cell2mat(textscan(fid, repmat('%f',1,maxcol),'Headerlines',hlines));
%close the file
fclose(fid);
%%Loop through the length of the variable name to construct the structure
for i1=1:length(varnames)
%use dynamic references for field names and the corresponsing column of
%values to construct the structure
s.(varnames{i1})=valmat(:,i1);
end
%%Calculate total time to run the program
tend=toc(tstart);
disp(['Total run time is ' num2str(tend) ' seconds.'])
And how fast was the following solution? I slightly updated it for accounting for our latest comments.
tic ;
% - Parse file.
fId = fopen('sample_text.txt','r') ;
for k = 1 : 4
fgetl(fId) ;
end
vars = lower(regexp(fgetl(fId), '\S+', 'match')) ;
data = reshape(fscanf( fId, '%f', Inf), numel(vars), []).' ;
fclose(fId) ;
% - Build output struct.
vars = strrep(vars, '+', '_plus_') ;
s = struct() ;
for k = 1 : numel(vars)
s.(vars{k}) = data(:,k) ;
end
toc
Adding a structure preallocation as you did saved another few seconds. I improved the preallocation slightly by using
%pre-allocate structure field header
empty_cells = repmat(cell(1),1,numel(varnames));
entries = {varnames{:};empty_cells{:}};
s=struct(entries{:});
In your code, varnames would be replaced by vars. Your method takes 28.9 seconds and my method is improved to 24.9 seconds. I was surprised to see that a little more than 3 seconds were saved by pre-allocating the field names header.
Try using the profiler if you never used it. It would be a good context for learning using it. In the command window, type:
profile viewer
In the field "Run this code", type the name of your script (M-File), and click on [Start Profiling]. You will get a report and if you click on the name of the script in the table, you will see what takes times (ranked). You will also see the highlighted code at the bottom with colors that indicate where the time is spent.
I'm surprised that a solution based on SSCANF and a single RESHAPE is slower that your approach though.
I have used the profiler. It pops up if you click run and time, as well. I don't like it because I find it incredibly hard to head. I have attached the outputs from running version 2 of my code using your method and version 3 using my method. I typically use tic and toc to compare segments of code as I find that method much easier to parse. The results of the profiler do show that much of the time is spent making calls to strsplit. I also noticed a good 6-7 seconds of variability in timing between runs, which I find surprising for something that is on the order of 25 seconds to run.
The comparison is somewhat skewed since you have not presented a method to find the number of header lines. It is just hard-coded, so I had to use my method for that. The comparison simply amounts to replacing
%open the file as read only
fid = fopen(fname,'r');
%read the rest of the file after the headerlines into a cell array as
%floats and then convert to a matrix
valmat=cell2mat(textscan(fid, repmat('%f',1,maxcol),'Headerlines',hlines));
%close the file
fclose(fid);
with
% - Parse file.
fId = fopen(fname,'r') ;
for k = 1 : (hlines-1)
fgetl(fId) ;
end
vars = lower(regexp(fgetl(fId), '\S+', 'match')) ;
data = reshape(fscanf( fId, '%f', Inf), numel(vars), []).' ;
fclose(fId) ;
I have also attached the profiler output for hardcoding the number of headerlines, which only takes about 6 seconds.
The slowest line of code in finding the headerlines is
cellsize=cell2mat(cellfun(@length,cellfun(@strsplit,celltext{1,1},'UniformOutput',false),'UniformOutput',false));
If your source text data file is not too confidential, you can send it to me (you got my email each time I sent you a message indicating that I posted a comment), and I can see quickly if I can speedup the processing.

Sign in to comment.

 Accepted Answer

The problem is that your file has discrepancies. If you look at the first row of data, it is missing an e before the +04 in the 5th column:
If it isn't a mistake due to a copy "by hand" for building an example, you could pre-process the content to correct for discrepancies before calling e.g. TEXTSCAN:
content = fileread('sample_text.txt') ;
content = regexprep( content, '(?<=\d)+', 'e+' ) ;
C = textscan( content, '%f%f%f%f%f%f', 'CollectOutput', true, 'HeaderLines', 5 ) ;
Now C should have a correct content.

7 Comments

Sorry, I did not see that error. It was a typo. I have fixed it and replaced the file. I have pieced together something that works though. Can you let me know if you have any thoughts on it or if there are any better methods than what I have used?
%%Open the file and read it using textscan
% read the file
fid = fopen('sample_text.txt','r');
%scan file to cell array as strings using a row delimiter of newline
%character
celltext=textscan(fid,'%s','delimiter','\n');
%convert 1x1 cell with nested 10x1 cell to 10x1 cell
celltext = vertcat(celltext{:});
%close the file
fclose(fid);
%converts rows x 1 cell of 1 x 1 cells to 10 x 1 with nested 1 x #of columns
cellsplit=cellfun(@strsplit,celltext,'UniformOutput',false);
%%Find the maximum number of columns and remove all lines that do not contain it
%in each rows x 1 cell, find vector with number of rows and number of columns
cellsize=cellfun(@size,cellsplit,'UniformOutput',false);
%convert cell of row and col numbers to a matrix of row and col numbers
matsize=cell2mat(cellsize);
%find max value of col
maxcol=max(matsize(:,2));
%find a logical vector that is 1 if column is not equal to maxcol
notmaxcol=matsize(:,2)~=maxcol;
%remove all rows that do not have number of cols = maxcols
celltext(notmaxcol)=[];
%%Find my variable names
%convert all strings that are numeric to numbers (strings will become
%empty)
convnums=cellfun(@str2num,celltext,'UniformOutput',false);
%find the location of the empty string
strloc=cellfun(@isempty,convnums);
%find the cell array of names for use in the structure
varnames=cellfun(@strsplit,celltext(strloc),'UniformOutput',false);
%unnest and horizontally concatenate cell names (' to transpose)
varnames=[varnames{:}];
%%Remove all rows through my names row
%find the location of the names row
nameind=find(strloc);
%remove the rows through the names row (handles the case in which there are
%multiple string rows and just assumes that the last one is the variable
%name row)
celltext(1:max(nameind))=[];
%%Convert remaining cells into a matrix of parameter values
%convert rows x 1 cell to rows x maxcols
celltext=cellfun(@strsplit,celltext,'UniformOutput',false);
%unnest and vertically concatenate numbers
celltext=vertcat(celltext{:});
%find matrix of values
valmat=cellfun(@str2num,celltext);
%%Loop through the length of the variable name to construct the structure
for i1=1:length(varnames)
%use dynamic references for field names and the corresponsing column of
%values to construct the structure
s.(varnames{i1})=valmat(:,i1);
end
This code doesn't work on the sample text because you are removing all rows that don't have max col elements, and one row of the header has more elements than there are data columns.
But I have to say that this is an excellent coding style with a lot of comments and information! It seems that you have a good mastery of data manipulation, so no doubt that you would make it work with a little extra work for managing special situations. For example, you could detect the number of data columns based on the last cells of cellsplit, as the table is regular:
while isempty( cellsplit{end}{1} ) % Remove all last empty rows.
cellsplit(end) = [] ;
end
nDataCols = numel( cellsplit{end} ) ;
Then if the header is always 5 rows long, you could directly extract rows 6 to end for the data block and process row 5 for variable names.
There are many other ways to import your data (e.g. using READTABLE and its options or IMPORTDATA). I'll show you one fairly concise alternative:
fId = fopen('sample_text.txt','r') ;
for k = 1 : 4
fgetl(fId) ;
end
vars = lower(regexp(fgetl(fId), '\w+', 'match')) ;
data = reshape(fscanf( fId, '%f', Inf), numel(vars), []).' ;
fclose(fId) ;
where you can see that we first read/discard 4 rows, extract/clean variable names in one shot, and extract/reshape data in one shot.
If you display vars and data, you will see the structure of the output. Then you can easily convert it to a struct or to a struct array if necessary.
Finally, I used REGEXP to extract variable names but you could use STRPLIT and clean the output from potential last empty cells as well. I just generally prefer REGEXP because it allows to better filter what we want to extract.
That row that is longer than the others should be the same length as everything below the header. So I uploaded a new file that corrects that problem and should run well now. The header number of columns will be far fewer than my actual number of columns.
I am now having trouble with one error handling case, so I also changed one of the variable names in the sample file to incorporate a '+'. The plus sign cannot be used as a field name, so I am trying to replace it by the word 'plus' in the varnames cell array.
I tried:
cellfun(@(x) strrep(x, '+', 'plus'), varnames, 'UniformOutput', false);
However, that is not working for me. Neither is simply
strrep(varnames, '+', 'plus')
I think that cellfun needs to be incorporated, but I cannot figure out the syntax.
Also, thanks for those other code snippets. I do not understand what is happening in the lower and reshape commands very well right now, but I will be trying to speed up the code once I finish error handling as it is quite slow now when processing 48000 rows x 127 columns. So I hope that they could be useful.
What is not working?
>> varnames
varnames =
1×7 cell array
{'Num'} {'Counter'} {'Time'} {'pos+l'} {'vel'} {'acc'} {'hi'}
>> strrep(varnames, '+', 'plus')
ans =
1×7 cell array
{'Num'} {'Counter'} {'Time'} {'posplusl'} {'vel'} {'acc'} {'hi'}
you just have to store the updated cell array back in varnames (where I added a few underscores for readability):
%%Loop through the length of the variable name to construct the structure
varnames = strrep(varnames, '+', '_plus_') ;
for i1=1:length(varnames)
%use dynamic references for field names and the corresponsing column of
%values to construct the structure
s.(varnames{i1})=valmat(:,i1);
end
which outputs:
>> s
s =
struct with fields:
Num: [4×1 double]
Counter: [4×1 double]
Time: [4×1 double]
pos_plus_l: [4×1 double]
vel: [4×1 double]
acc: [4×1 double]
hi: [4×1 double]
Now about my code, the call to LOWER just changes all variable names to lower case, just not to have a mix that could confuse people (e.g. .Num and .vel). For the RESHAPE, what we do with FSCANF is to read/parse the whole rest of the file in one shot/block and we get a vector of all values. This vector must be reshaped into an array, and we can do this based on the knowledge of the number of columns (which is the number of variables). RESHAPE reads its input and fills an output array linearly (which means column first) according to dimensions/sizes that must be specified. It is flexible as it doesn't require that we specify the size of all dimensions and computes the missing one if we pass all but one, based on the number of elements in the input (to illustrate, if I give you 20 elements and ask you to create an array of 4 rows with it, you can easily compute the number of columns (5)). So the call to reshape can be summarized as
reshape( input_array, nRows, [] )
----------- ----- ----
vector of (*) missing
all values size
As the filling is column first and values were read by row, we pass the number of columns as a number of rows, and we transpose afterwards. The best way to understand is to train with a few examples: if you are reading
10 11 12
20 21 22
from the file, the output of FSCANF will be:
data = [10,11,12,20,21,22] ;
now you can see checkout how RESHAPE works with this and you will realize that passing 3 as the number of rows and then transposing, is the way to go.
Thanks for all of the help. This is great. You are correct that it was a silly assignment error and the strrep command was working as you showed. Everything is working now. I will be trying to speed it up and may have additional questions whenever I get around to it, but I think that the asked question has been answered.

Sign in to comment.

More Answers (0)

Categories

Asked:

on 10 Oct 2017

Commented:

on 20 Oct 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!