Importing large datasheet while ignoring randomly placed non-numeric data
Show older comments
I am trying to import a large set of data usually around 10000 rows by 200 columns. The majority of the entries are numeric data but there will be random non-numeric entries in the set. Is there a way to just exclude all non-numeric entries when importing data with textscan?
I've tried using the pattern matching exclude option for textscan but it gives me a "Undefined function or variable 'formatSpec'" error.
filename = data.us3
formatfpec = '%16f%16f%16f....%16f%f%[^\*]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', '', 'WhiteSpace', '', 'TextType', 'string', 'ReturnOnError', false);
fclose(fileID);
data = cell2mat(dataArray);
I'm trying to make this robust as well so non-numeric text will be removed regardless of its position in the datasheet
Any help is appreciated
Answers (1)
dataArray=cell2mat(textscan(fileID,'','TreatAsEmpty', repmat('*',1,15),'CollectOutput',1));
Make sure my old eyes counted the correct number of asterisks...
This will return a double array instead of cell array w/ NaN for the overflow fields.
While not the issue in reading the file, this is a terrible format; some data are not zero but have as few as one significant digit while large values are missing entirely owing to field overflow.
Do you have any control at all over the process that generates these files? Having them switch to a G16.9 format would eliminate all those problems including the overflow field to have to parse.
5 Comments
Alex Szerszen
on 14 Sep 2018
Edited: Alex Szerszen
on 14 Sep 2018
I just typed "air code" in the edit window so it's quite possible did get a misplaced punctuation.(+)
A small section of the actual file would be the only way to really debug what might be going on...the above should(*) work just as desired for a "plain vanilla" file. Can't do anything useful from an image w/o having to create a test file and then don't know that there wasn't something unseen in the file that was important.
(*) Albeit, I guess I'm not sure I've tested the empty '' format option with 'TreatAsEmpty' to see if there's an interaction there, maybe. The magic incantation of an empty format string for an all-numeric file has the benefit of textscan automagically returning the data in the proper records-by-elementsperrecord array w/o having to count fields and writing a specific, counted format string that is a real pit(he proverbial)a(ppendage).
ADDENDUM (+) Oh, I see what I did...I came back and edited the original and when did added the 'CollectOutput' named pair at the end, thus putting the closing parenthesis in the wrong place. Not that it matters why, specifically, ... :)
dpb
on 14 Sep 2018
Oh...is the repmat count correct? That couldn't possibly be a cause of the failure maybe?
Is there any other text in the file besides the overflow indicator shown in the sample image? That would definitely cause a problem.
Alex Szerszen
on 14 Sep 2018
dpb
on 14 Sep 2018
There must be something else in the file, then, I did a simple test case and it seems to work as expected--
>> type alex.dat
0.949370243 0.647537226 0.038262460 0.123355203 0.293973863 0.680998914 0.644282796 0.187649735 0.416803602 0.782348041
0.112291675 0.413806386 0.776160595 **** 0.388473403 0.049865826 0.088877801 0.087513561 0.175390865 0.615052207
0.657602296 0.139910171 0.710691529 0.975162104 0.152392633 0.092873134 0.227578005 0.088578619 0.707417721 0.971135525
>> fid=fopen('alex.dat','r');
>> z=cell2mat(textscan(fid,'','TreatAsEmpty',{'****'},'CollectOutput',1,'ReturnOnError',0));
>> whos z
Name Size Bytes Class Attributes
z 3x10 240 double
>> z
z =
0.9494 0.6475 0.0383 0.1234 0.2940 0.6810 0.6443 0.1876 0.4168 0.7823
0.1123 0.4138 0.7762 NaN 0.3885 0.0499 0.0889 0.0875 0.1754 0.6151
0.6576 0.1399 0.7107 0.9752 0.1524 0.0929 0.2276 0.0886 0.7074 0.9711
>> fid=fclose(fid);
>>
Categories
Find more on Large Files and Big Data in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!