importing csv array of unknown size, characters

5 views (last 30 days)
kc
kc on 13 Jun 2012
I have an import script which I use to import data from a pressure/velocity flow sensor device. The data are of a general format, but the amount of header text varies, as does the rows & columns of the actual data. Each CSV file contains several "runs" or recordings of data. I need to import each run into a different array (so a cell works well). The one consistency is that each "run" is separated by the text "Run #x" where 'x' is the run number integer.
Originally, the software generating the CSV files was consistent in that all the CSV data always had 266 columns. With the help of another user here, I exploited that consistency in this code:
function data = import_KC(filename)
fid = fopen(filename);
run_num = 1;
%all runs contain n x m numbers and n is different for each run.
%So we cannot use a multidimensional array.
%We use a cell instead.
data = {};
while ~feof(fid)
%Skip the header lines. We don't know in advance how many there are
%so cant use the header functionality in textscan.
current_line = fgetl(fid);
while ischar(current_line)
match = strfind(current_line, 'Run #');
if(match)
break;
end
current_line = fgetl(fid);
end
%Load data. We don't know dimensions in advance
%res will initially contain all data from this run in a single column
res=textscan(fid,'%f','Delimiter',',');
res=res{1};
%find number of rows. Columns are fixed at 265
rows = length(res)/265;
%reshape into a matrix
res = reshape(res,265,rows)';
%Add this run to the set of runs
data{run_num} = res;
run_num=run_num+1;
end
fclose(fid);
end
Now I'm trying to figure out a way to use the same algorythm, but with a variable number of columns, as the new firmware seems to vary the columns sometimes.
My thought was to first read the data (or a portion of it, say the first 100 rows) into an array of some sort using something like xlsread, then read the size of that array to get the number of columns, which would replace the "266" in the script above.
The problem with this is that xlsread is very slow for very large data files, and these CSV files are huge. I thought reading only the first 100 rows would solve this, but I can't figure out how to reference only some number of rows, without knowing the number of columns.
Is there another way to read in these data, such that I can extract the number of columns there should be, Or if you have other suggestions I'd be open to them. Thanks!
KC
ps - thanks to whomever it was who helped me with this script last time. I couldn't find the posts in the newsgroup, which is where I think I posted about this before.

Answers (1)

Geoff
Geoff on 13 Jun 2012
So the general thing is this... Keep track of the current state outside the loop:
line_num = 0; % Current CSV line number
run_num = 0; % Current run number
ncols = 0; % Number of columns in current run
nrows = 0; % Number of rows in current run
runs = []; % Tracks actual run numbers (if they are not 1,2,3,4...)
rundata = {}; % The data for each run
I would use regexp to get the actual run number
while ~feof(fid)
current_line = fgetl(fid);
line_num = line_num + 1;
% Detect a new run
runtoks = regexp(current_line, '^Run #(\d+)', 'tokens');
if iscell(runtoks) && numel(runtoks) == 1
run_num = str2double(runtoks{1});
runs(end+1) = run_num;
rundata{end+1} = [];
ncols = 0;
nrows = 0;
continue;
end
if run_num == 0
fprintf('Warning: skipping line %d - no run header\n', line_num);
continue;
end
% First time round for each run, detect number of columns.
if ncols == 0
ncols = sum(current_line == ',') + 1;
rundata{end} = zeros(0,ncols); % yes, a 0-by-ncols matrix
end
% Parse row of data from line, and check for sanity
row = textscan( current_line, '%d', 'delimiter', ',' );
if numel(row) ~= ncols
fprintf('Column count mismatch on line %d (run %d)\n', line_num, run_num);
continue;
end
% Append the row to the current run data
nrows = nrows + 1; % (*) Note: see discussion
rundata{end}(nrows, :) = row;
end
Note (*): This could be pretty slow, because we're expanding the matrix for each run one row at a time. You can improve this in a similar fashion to how C++ vectors work: preallocate by some size (1000 rows), and grow the matrix by a constant or exponential amount when necessary:
if size( rundata{end}, 1 ) < nrows
rundata{end}(end+1:end+1000, :) = 0;
end
In the case of doing the above, you would also need to store the size of each run in an array (like I do with the runs array), and trim them to size afterwards.
Beware, I just typed this code straight into my browser. It might not compile, and may have other problems. I have not handled headers at all cos you didn't describe your data file completely, so you need to sort that out. That aside, this is the general approach I would use to read arbitrary data sets unless I needed something really special.
  6 Comments
kc
kc on 14 Jun 2012
It *used* to have the same # of columns each time (266), but now it doesn't. Detecting the # of columns in the data was the crux of my question. Looking at the raw CSV data in a text editor, there doesn't seem to be anything to indicate a new row, just commas for new columns. But when I open it in Excel, Excel knows when new rows begin. Strange.
Geoff
Geoff on 18 Jun 2012
Oh, in that case you probably have UNIX -style line endings. Open the file without translation (ie omit the "t" part of the format for fopen()). What text editor did you use? Notepad I suppose. That is NOT a text editor in my opinion. Try Notepad++. Using fgetl() should work correctly if your newlines are being detected properly.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!