Make this script faster
Show older comments
Dear all,
I have a txt file (eyetracker log) that has 12 columns and 2398068 rows and this code to import it:
The first line is the header with variable names, and only column number 9 is strings, the rest is double
Is there a way to make this script run faster?
Thanks for the insight
filename = 'file.txt' ;
% - Get structure from first line.
fid = fopen( filename, 'r' ) ;
line = fgetl( fid ) ;
fclose( fid ) ;
% - Build formatSpec for TEXTSCAN.
fmt = {'%f%f%f%f%f%f%f%f%s%f%f%f'} ;
% - Read full file.
fid = fopen( filename, 'r' ) ;
data = textscan( fid, fmt, Inf, 'Delimiter', ';' ) ;
fclose( fid ) ;
data = ([data{:}]) ;
data(2:end,9)=num2cell((strcmp(data(2:end,9),'Event 1 > Stimulation')));
data=cellfun(@str2double,data(2:end,[1:8 10:end]),'un',0);
5 Comments
Colin Edgar
on 17 Dec 2015
I have the same issue I think. My code is very similar and it is the last line, specifically the @str2double which slows it down. I am trying to use fscanf but have formatting issues (my first data column is "yyyy-dd-mm etc" timestamp. There may be a way to use cell arrays more efficiently, but for me may not work if I have to change a lot of code.
Renato Agurto
on 17 Dec 2015
Hi, can you give us a small example how the data is formated in the text file?
Colin Edgar
on 17 Dec 2015
Edited: Colin Edgar
on 17 Dec 2015
My issue with fscanf is that I don't see how to incorporate delimiters, it will read the file all into one cell. As long as I capture the numbers, could ignore the first column. Data example:
"2014-11-11 00:00:00.1",9830807,255,0.0930586,18.1384,151.47,-4.5321,100.461,-0.569257,1.00181,0.076258,330.491,99.6897
"2014-11-11 00:00:00.2",9830808,255,0.0930802,18.1438,151.384,-4.53333,100.458,-0.569257,1.00181,0.076258,330.489,99.688
"2014-11-11 00:00:00.3",9830809,255,0.0930782,18.1433,151.433,-4.53333,100.458,-0.569257,1.00181,0.076258,330.49,99.6912
Edit---Also there are occasional NaN values in the numeric data.
jgg
on 17 Dec 2015
I had a similar issue. I ended up doing the initial data cleaning in Stata or R since it was easier to reformat the columns.
Colin Edgar
on 17 Dec 2015
I can't make fscanf ignore the first "" string, for example:
frmt = '%*s%s%s%s%s%s%s%s%s%s%s%s%s%[^\n\r]';
A = fscanf(fid, frmt, [12, inf]);
A = "
Unless I do this:
A = fscanf(fid, '%s', [12, inf]);
A = 12 x 16833 (Char)
What I want is:
A = 12 x 16833 double
Answers (1)
Colin Edgar
on 17 Dec 2015
Edited: Colin Edgar
on 17 Dec 2015
Here is my solution, takes only ~1sec to run per file (~2MB 12 x 18000). This is for the example data I posted above, but with the initial "timestamp" removed. I believe this answers the OP issue as well, since data was very similar.
formatSpec = '%f,%f,%f,%f,%f,%f,%f,%f,%f,%f,%f,%f\n'%
fid = fopen(flnm,'r');
t1 = fgetl(fid); %reads past heading, I know it's a hack but...
t1 = fgetl(fid);
t1 = fgetl(fid);
t1 = fgetl(fid);
mat = fscanf(fid, formatSpec, [12,inf]);
mat = mat'; %transpose to correct layout
fclose(fid);
Versus my old version which took ~15sec (similar to approach of OP)
formatSpec = '%s%s%s%s%s%s%s%s%s%s%s%s'
fid = fopen(flnm,'r');
C = textscan(fid,formatSpec,'HeaderLines',4,'Delimiter',',');
mat = cell2mat(cellfun(@str2double,C,'UniformOutput',false));
fclose(fid);
Categories
Find more on Workspace Variables and MAT Files in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!