textscan unable to parse floats beyond 107M bytes

I am using textscan to parse string data read from a large file. I read data from the file into a string, and then use textscan on the string. I do this iteratively over chunks of the file of a configurable size. The file is 9 space-delimited float columns with linefeeds between rows.
Data example:
00 0001 0023 0079 1.1617e-01 -8.2676e+04 3.0000e+10 0000 3.5000e-08
00 0001 0023 0080 1.1619e-01 -6.2363e+03 3.0000e+10 0000 3.5000e-08
00 0001 0023 0081 1.1620e-01 2.5781e+05 3.0000e+10 0000 3.5000e-08
Pseudocode:
read a chunk of the file into a char array
ensure array is rectangular (full lines of 9 columns ending at a linefeed)
[dataCell lastPosition] = textscan(cleanChunk, '%f', 'CollectOutput', 1, 'BufSize', chunksize+10)
repeat
If I try to scan a chunk that's larger than 107M bytes, textscan stops reading part-way through a float and reports the lastPosition as exactly 107374182 bytes. (IE: data = '3.0000e+10' at that position, with the 107374182nd byte splitting the float into 3.000 and 0e+10.) Any chunk size above 107M bytes results in the float at that precise location being split in half by textscan. I have checked this precise location in the file and there is no delimiter, linefeed or other artifact there. I have checked the string itself after reading from the file and it is also correct. The string being passed to textscan is longer than 107M bytes, has complete lines of data, and has no artifact at that location. If the chunk size is less than 107M bytes, the textscan correctly scans every float right past this location in the file. In other words, if I do not ask textscan to scan a string longer than that exact number of bytes it parses the strings correctly.
Is this some undocumented limit with textscan, or have I missed something?
Edit: 64-bit Ubuntu Linux v12.04, Intel Core i7-3570K, 3.4GHz x 4, 8GB RAM, MATLAB R2012a
I just re-ran the same code on a different file with four columns of float and got the same result at the exact same location, 107374182 bytes.
Example:
00000001 4.0652e-03 2.0001e-05 0.0000e+00
00000002 6.0975e-03 4.0003e-05 0.0000e+00
00000003 8.1288e-03 6.0004e-05 0.0000e+00

Answers (1)

'BufSize' is that a documented Name-Value Pair Arguments of textscan? It is of textread and strread, but not of textscan?
AFAIK: textscan has no such limitations.
I made this test with R2013b,64bit,Win7,32GB
str = '00 0001 0023 0079 1.1617e-01 -8.2676e+04 3.0000e+10 0000 3.5000e-08';
fid = fopen( 'd:\tmp\huge.txt', 'W' );
for jj = 1 : 1e7
fprintf( fid, '%s\n', str );
end
fclose( fid );
tic
buf = fileread( 'd:\tmp\huge.txt' );
cac = textscan( buf, '%f%f%f%f%f%f%f%f%f', 'CollectOutput', true );
toc
num = cac{:};
whos num
outputs
Elapsed time is 34.122473 seconds.
Name Size Bytes Class Attributes
num 10000000x9 720000000 double
The size of d:\tmp\huge.txt is 664MB
&nbsp
Addendum
tic
[dataCell,pos] = textscan( buf, '%f', 'CollectOutput', true, 'BufSize', 720e6+10);
toc
vec = dataCell{:};
whos vec num
outputs
Elapsed time is 33.691454 seconds.
Name Size Bytes Class Attributes
num 10000000x9 720000000 double
vec 90000000x1 720000000 double

2 Comments

It may be an architecture / version thing. I will update my original post with those details, but here they are:
64-bit Ubuntu Linux v12.04, Intel Core i7-3570K, 3.4GHz x 4, 8GB RAM, MATLAB R2012a
Re: BufSize - this is in the documentation for textscan in this version of MATLAB, although changing it does not appear to have any effect on performance or on this bug.
Also: I just re-ran the same code on a different file with four columns of float and got the same result at the exact same location, 107374182 bytes.
You might want to report it to the tech-support.
BufSize is documented in the file textscan.m of R2013b, but not in the "Help Browser Documentation". I have not checked the pdf-file.
Parameter Value Default
--------- ----- -------
BufSize Maximum string length in bytes 4095

Sign in to comment.

Categories

Products

Tags

Asked:

on 3 Mar 2015

Edited:

on 3 Mar 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!