Matlab unable to parse a Numeric field when I use the gather function on a tall array.

So I have a CSV file with a large amount of datapoints that I want to perform a particular algorithm on. So I created a tall array from the file and wanted to import a small chunk of the data at a time. However, when I tried to use gather to get the small chunk into the memory, I get the following error.
"Board_Ai0" is the header of the CSV file. It is not in present in row 15355 as can be seen below where I opened the csv file in MATLAB's import tool.
The same algorithm works perfectly fine when I don't use tall array but instead import the whole file into the memory. However, I have other larger CSV files that I also want to analyze but won't fit in memory.
UPDATE: So apparently the images were illegible but someone else edited the question to make the size of the image larger so I guess it should be fine now. Also I can't attach the data files to this question because the data files that give me this problems are all larger than 5 GB.

12 Comments

images are virtually illegible and can't do anything with them; copy and paste the text of the command line instead.
It would be best to post a working example that demonstrates the problem with a small file...
Hi,
I find it remarkable that the error stems from reading "starting at offset 704643103". This may indicate that the problematic lines is not 15355, but... somewhere further down.
How did you import the whole file into memory? If you have accomplished this differently, please also try:
data = readall(ds);
and let us know the results.
Best wishes,
Harald
That's weird. The CSV has only 200,000,000 rows.
I imported the whole file into memory using,
data = readtimetable(datalogfile);
When I tried to use readall, I got the following error,
This is interesting as it now complains about a different row, still with the same offset.
We could probably track down what the problem with the file is by using read in a loop, but the question will be what you can then do about it.
Since readtimetable works, it may be more promising to try this instead:
ds = fileDatastore("yourfile.csv", "ReadFcn", @readtimetable, "UniformRead", true);
and then try to proceed with tall, gather, ...
Forget about importing the file for a moment:
  1. Open the file using a reliable text editor, e.g. Notepad++
  2. Ensure that you have the text editor set so that it shows all characters
  3. Scroll down to that line
  4. Take a screenshot
  5. Upload the screenshot here in a new comment
"I can't attach the data files to this question because the data files that give me this problems are all larger than 5 GB."
I was the one suggested a MWE (minimum working example); my thinking was that you should be able to dupicate the problem using tall array with smaller arrays as well if it is inherent bug. Or, finding out the size of the file that first creates the issue could also be useful debugging info...
Thanks to whomever for the image enlargement, my old eyes couldn't come close and for some reason I couldn't seem to be able to make it larger...
@Ninad, in case you have not noticed it, I have posted another comment and suggested workaround to try:
Also, I think the challenge is the offset which is possibly provided in bytes, making it hard to identify the problematic lines.
@Harald I used your workaround. I was successfully able to analyze the complete datafile that I have mentioned above. This is the datafile that can completely fit in my memory. However, when I tried to analyze a datafile that was larger than what could fit in my memory I got the following error.
The code that I ran to get the above error is,
datalogfile='1kcross.csv';
function out=givetimetable(~)
out=timetable(seconds(2),[0.139389038085938],[-0.103683471679688]);
end
ds = fileDatastore(datalogfile, "ReadFcn", @readtimetable, "UniformRead", true, "PreviewFcn", @givetimetable);
data=tall(ds);
slice2=data{1:200000,:};
slice2=gather(slice2);
@dpb As of now, the smallest file which gives me the error originally mentioned in the question is 2.26 GB.
This makes sense: as it stands, MATLAB tries to import the entire file at once - I should have mentioned that.
You need to set 'ReadMode' to 'partialfile' and specify a 'ReadFcn' that imports a certain number, say 100,000, of rows at a time. It could then look like this:
ds = fileDatastore("yourfile.csv", "ReadFcn", @readdata, "UniformRead", true, "ReadMode","partialfile");
data = readall(ds);
function [data,startrow,done] = readdata(filename,startrow)
nRows = 100000;
if isempty(startrow)
startrow = 2;
end
opts = detectImportOptions(filename);
opts.DataLines = [startrow, startrow+nRows-1];
data = readtimetable(filename, opts);
data = rmmissing(data);
done = height(data) < nRows;
startrow = startrow + nRows;
end
Best wishes,
Harald
So the code works well when I run it on a file that can fit in memory. But when I run it on a file that cannot, I get the following error:
The code is:
function [data,startrow,done] = readdata(filename,startrow)
nRows = 10000000;
if isempty(startrow)
startrow = 2;
end
opts = detectImportOptions(filename);
opts.DataLines = [startrow, startrow+nRows-1];
data = readtimetable(filename, opts);
data = rmmissing(data);
done = height(data) < nRows;
startrow = startrow + nRows;
end
function [data,startrow,done]=givetimetable(~,~)
data=timetable(seconds(200.000005),[0.139389038085938],'VariableNames',["Board0_Ai0"]);
startrow=2;
done=true;
end
ds = fileDatastore("1kcross.csv", "ReadFcn", @readdata, "UniformRead", true,"PreviewFcn",@givetimetable,"ReadMode","partialfile");
data=tall(ds);
slice=data(1:10000000,:);
slice=gather(slice);
What am I still doing wrong?
FYI, a TALL array is meant to allow you to operate on the entire table, even if it doesn't fit into memory. If you want to work on chunks of the file, don't use TALL.
Using datastore directly will let you read chunks of data.
ds = tabularTextDatastore(files,....)
while hasdata(ds)
T = read(ds)
% Do stuff.
end
However, that's not going to solve the problem because you have rows that don't contain numeric data. tabularTextDatastore doesn't allow for that.
I like @Harald's solution--but with some modification. I'd avoid calling detectImportOptions every iteration. For a datastore to work, the schema should be the same each time.
function [data,startrow,done] = readdata(filename,startrow)
persistent opts
if isempty(opts)
opts = detectImportOptions(filename);
end
nRows = 10000000;
if isempty(startrow)
startrow = 2;
end
opts.DataLines = [startrow, startrow+nRows-1];
data = readtimetable(filename, opts);
data = rmmissing(data);
done = height(data) < nRows;
startrow = startrow + nRows;
end
There is still a problem with this; using opts.DataLines to manage the chunks still forces you to read lines up to the startRow in order to know where to start. That will mean each subsequent read will be slower.

Sign in to comment.

 Accepted Answer

Hi @Ninad, thanks for sharing the file. I see that your .csv includes the Variable Names in some data rows.
To handle this, you can use the TreatAsMissing property with tabularTextDatastore to treat those rows as NaN
data = tabularTextDatastore("1kwogndrd1.csv",TreatAsMissing={'Time','Board0_Ai0'},SelectedVariableNames={'Board0_Ai0'});
Then you can gather slices of the data without errors:
ds = tall(data);
slice = ds.Board0_Ai0(1:200000);
slice = gather(slice);
If you want to calculate the mean on the entire column, I recommend computing it first and then gathering the result. Use "omitnan" to exclude the NaN rows introduced by TreatAsMissing
m = mean(ds.Board0_Ai0,"omitnan");
gather(m)
Hope this helps!

1 Comment

I"d just use the datastore directly instead of using TALL. Assuming you want the mean of each chunk.
ds = tabularTextDatastore("1kwogndrd1.csv",TreatAsMissing={'Time','Board0_Ai0'},SelectedVariableNames={'Board0_Ai0'});
M = {};
while hasdata(ds)
data = read(ds);
M{end+1} = mean(data)
end
If you want the mean of the entire variable, TALL doesn't need to be chunked.
ds = tabularTextDatastore("1kwogndrd1.csv",TreatAsMissing={'Time','Board0_Ai0'},SelectedVariableNames={'Board0_Ai0'});
data = tall(ds)
M = mean(data)
M = gather(M)

Sign in to comment.

More Answers (2)

It appears it is detectImportOptions that is having the problem -- apparently it tries to read the whole file into memory first before it does its forensics.
I don't think you need an import options object anyway, use the 'Range' named parameter in the argument to readtimetable
Something like
function [data,startrow,done] = readdata(filename,startrow)
nRows = 10000000;
if isempty(startrow)
startrow = 2; % this looks unlikely to be right from the earlier image there are 3(?) header rows?
end
range=sprintf('%d:%d',startrow, startrow+nRows); % build row range expression
data = readtimetable(filename, 'Range',range);
data = rmmissing(data);
done = height(data) < nRows;
startrow = startrow + nRows;
end
This may still have some issues using the timetable, however if it first reads variable names from a header line which header line isn't there in the subsequent sections of the file. I don't know what trouble you'll run into with such large files if try to read 100K lines into the file but tell it to also read the variablenames from the second or third line in the file....probably ignoring variable names and letting MATLAB use defaults then set the Properties.VariableNames after reading of just accept the defaults would be best bet.

5 Comments

I ran into the same issue.
The code was as follows: (readtimetable changed to readtable because MATLAB couldn't recognize the rowtimes)
function [data,startrow,done] = readdata(filename,startrow)
nRows = 200000;
if isempty(startrow)
startrow = 2;
end
range = sprintf('%d:%d',startrow,startrow+nRows);
data = readtable(filename,'Range',range);
data = rmmissing(data);
done = height(data) < nRows;
startrow = startrow + nRows;
end
function [data,startrow,done]=givetable(~,~)
data=table(seconds(200.000005),0.139389038085938,0.139389038085938);
startrow=2;
done=true;
end
ds = fileDatastore("1kcross.csv", "ReadFcn", @readdata,"UniformRead", true,"ReadMode","partialfile","PreviewFcn",@givetable);
data=tall(ds);
slice=data(1:200000,:);
slice=gather(slice);
mean(slice{:,2})
I was afeared of that happening; it looks like the readXXX family all try to read up to the area requested although the message says it's still on pass 1.
I didn't mention it before because I presumed @Harald knew what was doing, but 'm not sure about the logic for the first versus later calls to do the branching inside readdata -- how does gather() know to call without the last argument the first time but with it subsequently? I don't see such code anywhere. That, to me, seems likely to be a problem, but not sure that's what's wrong here.
Have you tested that the readdata function can work if called directly from the command line first? Make sure it works for both a first call and a second first, then can deal with gather.
It would be interesting to print out the Range value when it errors.
@Ninad, sorry that my suggestion did not work and for the troubles around this. I would usually test my suggestions but this is difficult due to not having the data.
@dpb, while I work at MathWorks, I am not a developer or in Technical Support. I try to support Answers as my core duties permit.
@Harald, no problem, just commenting on why I hadn't poked harder, earlier...
If @Ninad would attach a short section of a file it would make it simpler, indeed. It's not convenient at the moment to stop a debugging session and try to create a local copy of a similar file to play with/poke at.
The documentation isn't all that helpful, the only examples I can find using tables/timetables with tall arrays are tiny data files and don't use the filedatastore so they don't have a callback function with a table. I don't believe there is an example of the combination....

Sign in to comment.

Providing the RANGE argument does not prevent READTABLE from calling its automatic format detection:
which might involve loading all or a significant part of the file into memory. The documented solution is to provide an import options object yourself (e.g. you can generate this on a known good file of a smaller size and then storing it) or alternatively using a low-level file reading command, e.g. FSCANF, FREAD, etc.

4 Comments

Good point, @Stephen23, I had figured it surely was smart enough to only read a small amount, but maybe not. I has started to suggest the standalone import options object, but thought surely it would still function without; certainly not with memory on the scan forensics. I suppose even if that did succeed, it could have it still in memory and not have more.
Guess one can try cutting down the range to see if can make it work on smaller chunks but I suspect Mathworks didn't really think about tables as tall data all that much and it probably will need to revert to low-level i/o. I was hoping against hope to hold of on that route...
So I tried going the import options route, MATLAB crashed. Anyways, I have decided to stop working on this problem and just buy more RAM.
@dpb Since you mentioned that having the data file would be easier on the other answer, I am sharing a data file which gave me probems here:
https://drive.google.com/file/d/162QUEpXudcHb5sE1IQxDLdUuu_RcCkGw/view?usp=sharing
I was suggesting to attach a piece of the file (perhaps zipped to include a little more). That would be enough for folks to have enough to test with that duplicates the actual format.
What, precisely, does "MATLAB crashed" mean? Actually aborted MATLAB itself or another out-of-memory or ...?
MATLAB crashed means the MATLAB window closed mid-run. Then, a Mathworks Crash Reporter window opened asking me to send a crash report to Mathworks.

Sign in to comment.

Categories

Products

Release

R2025a

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!