Why fwrite is ~320x slower in the second interation and onwards when writing interleaved data?
9 views (last 30 days)
Show older comments
Kouichi C. Nakamura
on 22 Jul 2019
Commented: Kouichi C. Nakamura
on 23 Jul 2019
I'd like to write a binary file, which should contain four vectors of integers in the same length in an interleaved fashion.
For example, let si,j denote the sample from the i th channel at time step j.
I can prepare a long vector just as above and then pass it to fwrite and that is easier.
But that design is not great, because when individual vectors are really long, and when I have to accomodate a lot of vectors rather than just four, it will enconter a memory issue.
So, a clever way to to this is to write one vector input at a time without keeping all of them in the memory.
The code like below works, but I found that the performace of fwrite is much worse in the second iteration onwards compared to the first iteration.
While t3 in the first loop was less than a second (0.281 sec), t3 took roughly 1 min 30 sec for the second to fourth iterations (~320 times slower). I wonder why this is so slow, and if there is a work around for this.
fid = fopen(newfile,'w');
for i = 1:4
filepath = fullfile(datadir,filenames{i});
data = load_data(filepath); % int16 vector
status = fseek(fid,(i-1)*2,'bof');
if status == -1
error('fseek failed')
end
t1 = datetime;
fwrite(fid, data, 'int16', 2*(n-1));
t2 = datetime;
t3 = duration(t2-t1) % why this is slower in second iteration and onwards?
end
fclose(fid);
t3 = duration
00:00:00.2810
t3 = duration
00:01:37.2600
t3 = duration
00:01:29.9700
t3 = duration
00:01:29.6490
90 (sec) / 0.281 (sec) == ~320 (times)
A simplified code for testing
The longer the vectors are the slower the second and third iterations become, suggesting the slowness has something to do with data before reaching the end of file rather than adding extra bytes at the end of the file.
len = 1000;
i1 = ones(len,1,'int16');
i2 = ones(len,1,'int16').*2;
i3 = ones(len,1,'int16').*3;
fid = fopen('temp.bin','w')
t1= datetime;
fwrite(fid,i1,'int16',2*2);
t2 = datetime;
t3(1,1) = duration(t2-t1);
t1= datetime;
fseek(fid,2, 'bof')
fwrite(fid,i2,'int16',2*2);
t2 = datetime;
t3(2,1) = duration(t2-t1);
t1= datetime;
fseek(fid,2*2, 'bof')
fwrite(fid,i3,'int16',2*2);
t2 = datetime;
t3(3,1) = duration(t2-t1);
fclose(fid);
t3.Format = 'dd:hh:mm:ss.SSSS'
len = 1000
t3 = 3×1 duration array
00:00:00.0070
00:00:00.0140
00:00:00.0190
2~3 times slower
len = 10000
t3 = 3×1 duration array
00:00:00.0060
00:00:00.0850
00:00:00.1050
14~18 times slower
len = 100000
t3 = 3×1 duration array
00:00:00.0280
00:00:00.9820
00:00:00.9080
35 times slower
0 Comments
Accepted Answer
Guillaume
on 22 Jul 2019
I would suspect the reason for the much slower writes in iteration 2 and onward is that for the first iteration, since the file is new, you're just writing (your data, and 0s in between). From iteration 2, matlab must first read the relevant part of the file, insert the new content and rewrite that.
If you can't fit the whole interlaced data in memory in one go, then you should read the input data in chunk, interlace it in memory, then write it as one continuous block. Repeat for the next chunk of data.
7 Comments
Guillaume
on 23 Jul 2019
Edited: Guillaume
on 23 Jul 2019
No, if you're reading and writing in chunk, the buffer size doesn't matter. and for best performance, I would think you'd want to be above that buffer size. The chunk size should only depend on how much memory you want to use / have available.
So the code would be something like:
%filenames: cell array of input files
chunksize = 4000; %whatever you want, the actual size in bytes will be nfiles * chunksize * sizeof(datatype)
buffer = zeros(numel(filemames), chunksize, 'int16');
fidin = zeros(size(filenames));
numread = zeros(size(filenames));
%open the input files
for fileidx = 1:numel(filenames)
fidin(fileidx) = fopen(fullfile(datadir, filenames{fileidx}), 'r');
assert(fidin(fileidx) > 0, 'failure to open file %d', fileidx);
end
fidout = fopen(newfile, 'w');
while true
for fileidx = 1:numel(filenames)
[bytes, numread(fileidx)] = fread(fidin(fileidx), chunksize, 'int16');
buffer(fileidx, 1:numread) = bytes;
end
assert(all(numread == numread(1)), 'files don''t have the same length');
fwrite(fidout, buffer, 'int16'); %since data has been read in rows and write is by column, the data is written interlaced
if numread(1) < chunksize || feof(fidin(1))
break;
end
end
fclose(fidout);
for fileidx = 1:numel(filenames)
fclose(fidin(fidin));
end
More Answers (1)
Walter Roberson
on 22 Jul 2019
Edited: Walter Roberson
on 22 Jul 2019
fseek after end of file followed by fwrite, results in the data being written at eof. This is contrary to POSIX which requires that the gap be filled with 0 (possibly implicitly with a Demand Zero scheme). In MATLAB if you want to write at some point after eof you must write into the gap yourself.
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!