Please help me create a tall array from a large binary file and "fileDatastore" without running out of memory.
1 view (last 30 days)
Show older comments
I have a large data file (the particular file I am working with now is ~60gb, though a few hundred gb's is typical) that I want to create a tall array of. I am hoping this will allow me to quickly perform calculations on the data without loading it into memory. The data is in a custom format, so it seems that I am stuck with using the custom "fileDatastore" format.
Making the datastore is not a problem, but every time I try and load it I run out of pagefile memory (and have already made my pagefile as big as possible on Windows 10). The issue seems to be that Matlab requires temporarily loading the full datastore into memory before the tall array can be made. It would (supposedly) free up the memory after the file was completely read, but it never gets there. This is due to my not being able to find any way to tell the "fileDatastore" to only read part of the data at once. In the other types of datastore there is a "ReadSize" property that seems to do this, but that is missing from fileDatastore's valid options. The @readfcn I am using is setup to [partially read the data correctly (I could easily tell it to read the next X values from the current position), I just dont know how to make fileDatastore pass along a 2nd parameter that has this information (the 1st parameter is the file name).
I imagine I could manually break up the data into separate datastores and then combione them each into the same tall array somehow, but this 1) would be rather tedious to do EVERY time I want to us make a fileDatastore, and 2) I imagine this would negatively impact the delayed execution feature, since (I'd guess) matlab would try any optimize reading the data from each small sub-datastore individually rather than optimizing for the whole data file. As such, I'd much rather find a way to do this from a single fileDatastore.
.
.
PS If any MathWorks staff sees this - please suggest to the development team to fix this. Granted I am using my personal computer for this, not some cluster with a terrabyte of ram, but it is kind of ridiculous that a computer with an i7 + 16gb of RAM and Matlab's "latest and greatest big data solution" can't manage to deal with a ~60gb file without crashing the computer....I cant imagine that it would take someone (who is familiar with the source code) more than a few hours to add an option of "pass this number to your read function to decide how much it should read at a given time" (or something similar).
1 Comment
Hatem Helal
on 6 Dec 2018
Edited: Hatem Helal
on 6 Dec 2018
How are your large binary files generated? It would also be worth evaluating whether you can modify that tools/process to instead create a folder full of large files that represents your large dataset. For example a folder with 60 files that are each ~1GB can be trivially partitioned for parallel analysis. This is a widely used best practice for the storage / representation of large datasets that would let you comfortably analyze your data on your personal computer.
Accepted Answer
Edric Ellis
on 10 Jul 2017
In R2017a, fileDatastore is currently restricted to reading entire files at a time. This is a known limitation of the current implementation, and this is definitely something we hope to be able to address in a future release of MATLAB. For now, unfortunately the only workaround is to split your data into multiple files so that each file can be loaded without running out of memory. You can use a single fileDatastore instance with multiple data files, as shown in the first example on the fileDatastore reference page.
More Answers (1)
See Also
Categories
Find more on Large Files and Big Data in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!