Error using datastore: "Cannot detect TextscanFormats from..." - Limit on number of characters in header row

6 views (last 30 days)
I came across an error that I traced down through datastore.m and lower level code. I have a folder on my hard drive with a number of .csv files in it. The .csv files have about 3000 columns. The first row is a header row with variable names for each column. I use the datastore command as follows:
ds = datastore([csv_dir '\*.csv']);
The error in my code traces to this line and the error in datastore.m is as follows (path removed):
Error using datastore (line 114)
Cannot detect TextscanFormats from C:\...\....csv.
So I started stepping into the code, and the error ultimately traces down to the function readVarFormat (in readVarFormat.m). This function opens and reads data from the first .csv file in the given location and detects variable names and number formats. A variable in this function, DEFAULT_PEEK_SIZE, is set to 16384, the number of bytes (or characters) that the function reads in.
% data to detect from
stringData = read(channel, DEFAULT_PEEK_SIZE);
% close the channel as its no longer needed we assume, 16KB is enough data
% to detect the format from large values of headerLines may not play nicely
% with this
close(channel);
The DEFAULT_PEEK_SIZE limit is the cause of the error. "stringData" reads in the first 16384 characters of the file, which does not capture the first row of my data. Yes, they are very long variable names, with good reason. Starting at Line 70, the variable names are extracted from the string. Obviously, the fact that "stringData" does not contain the entire header row is going to be a problem, but the code never gets that far.
% extract variable names, just like readtable
if readVarNames
% check if data is available to detect variable names from, error could
% be improved
isDataAvailable(stringData, file, 'VariableNames');
% Read in the first line of var names as a single string, skipping any
% leading blank lines and header lines. This call handles non-default
% row delimiters like : for example ignoring delimiter and whitespace.
% This call also accepts CommentStyle as we want to skip comment lines.
[raw,strIdx] = textscan(stringData, '%s', 1, 'Delimiter', '', ...
'Whitespace', '', 'Headerlines', hdrLines, ...
'EndOfLine', ds.RowDelimiter, ...
txtScanArgsforIntroSpection{:});
hdrLines = 0; % just skipped them
if isempty(raw{1}) || isempty(raw{1}{1})
error(message('MATLAB:datastoreio:tabulartextdatastore:noDataToRead', ...
file, 'VariableNames'));
else
vnline = raw{1}{1};
end
end
The textscan call copies everything in "stringData" until the first end of line character. In this case, there is no end of line character, so "stringData" is just copied to "raw". "strIdx" becomes 16384, as that is the index of the last character read in.
Finally, we come to the actual error, when the code tries to detect the format of the data after the first row. The first line inside the if statement below, "stringData(16384+1:end)" evaluates to an empty string. The isDataAvailable function call on the next line calls an internal function that just says if "stringData" is empty, return an error, which is the error I get.
% extract formats
if detectFormat
% check if data is available to read formats from
stringData = stringData(strIdx+1:end);
isDataAvailable(stringData, file, 'TextscanFormats');
% Guess a format string for the dataline by reading it as a single
% string, skipping any leading blank lines. This call handles
% non-default row delimiters like (':') for example, ignoring delimiter
% and whitespace. This call also accepts CommentStyle as we want to
% skip comment lines.
raw = textscan(stringData, '%s', 1, 'Delimiter', '', ...
'Whitespace', '', 'Headerlines', hdrLines, ...
'EndOfLine', ds.RowDelimiter, txtScanArgsforIntroSpection{:});
if isempty(raw{1}) || isempty(raw{1}{1})
error(message('MATLAB:datastoreio:tabulartextdatastore:noDataToRead', ...
file, 'TextscanFormats'));
else
% determine the format string from the first line
formatStr = matlab.internal.table.determineFormatString(raw{1}{1}, ...
delim, whitespace, treatAsMissing, txtScanArgsforIntroSpection);
% convert to a struct
fStruct = matlab.iofun.internal.formatParser(formatStr);
outFormatCell = fStruct.Format;
skippedVec = zeros(1,numel(outFormatCell));
end
If there were fewer than 16384 characters in the header row of my .csv file, the first textscan call (when "strIdx" is set) would have read those first row characters into "raw" and set "strIdx" to the number of characters in the first row. Then, "stringData = stringData(strIdx+1:end)" would have set "stringData" to all of the text in "stringData" from the start of the second row on (to 16384). Depending on the number of characters in the header row and the second row, the complete second row (first row of data) may not fit into the first 16384 characters of the file. You would pass the isDataAvailable check, but the subsequent textscan call would read in an incomplete row of data.
Why would "DEFAULT_PEEK_SIZE" be set to 16384? The comment shortly after (repeated below) indicates that 16KB is "enough" data. Clearly, that is only true if you can read in the header row and the first row of data in that 16KB. When would memory actually become an issue? That is the only reason I can think of that would drive that limit. The comment even states that this "may not play nicely" with large amounts of header data.
% close the channel as its no longer needed we assume, 16KB is enough data
% to detect the format from large values of headerLines may not play nicely
% with this
As a workaround, I suppose I can save my own version of readVarFormat.m and modify the DEFAULT_PEEK_SIZE or use a different IRI data read method, but all of the files between datastore.m and readVarFormat.m use @ folders. Even if I duplicated the entire folder structure, wouldn't MATLAB's versions take precedence? Of course, I could change MATLAB's version directly to get around that (after saving a copy, of course).
I don't know if 16384 was pulled out of thin air as "way more than anyone would need," or if there was a real reason.
Matt

Answers (1)

Aaditya Kalsi
Aaditya Kalsi on 19 Aug 2016
First of all, that's a very thorough analysis, so good on you to make it through all that code.
I believe the reason this default was chosen is because datastore construction should be fast, and that has to be weighed against always getting detection right.
The bottleneck is more around the I/O rather than memory allocation.
You could change your DEFAULT_PEEK_SIZE or create a request with Technical support to allow creating a preference or some user settable value.
  1 Comment
Matt
Matt on 20 Aug 2016
I changed DEFAULT_PEEK_SIZE to 256 KB. My header data was just over 100 KB. That naturally eliminated the error and I saw no change in performance or any other errors or warning. The datastore was still created very quickly and I did email Tech support about the issue. I'll see what they have to say.
Something I didn't mention: If you provide the variable names and the data formats when you create the datastore, you can bypass the if statements that would cause problems with the DEFAULT_PEEK_SIZE set to 16 KB. You would have to do both, and even so, there does not appear to be any reason to require that in order to handle large header data.

Sign in to comment.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!