Reading in ascii files with white space as delimiter.

349 views (last 30 days)
James Russell
James Russell on 9 Nov 2015
Commented: Star Strider on 13 Nov 2015
I am trying to read in a very simple ascii file that looks like the following:
hPa m C C % g/kg deg knot K K K
994.0 270 7.0 6.0 93 5.93 40 10 280.6 297.1 281.6
989.0 312 6.2 5.2 93 5.64 42 12 280.2 295.9 281.2
972.0 455 4.8 4.0 95 5.27 48 18 280.2 294.9 281.1
There seem to be a dozen functions that I can read this in with but I'm struggling with all of them.
The simplest seems to be dlmread. I'm currently using the command:
M = dlmread('radiosonde.ascii',' ',3,1)
However this seems to register a single space as the delimiter instead of all the white space. If I use:
M = dlmread('radiosonde.ascii')
It registers the white space as the delimiter but I cannot specify to ignore the headers. Is there some way to specify white space as the delimitter while also ignore the headers?
Is there a better way to do this? Why hasn't Mathworks streamlined reading text files to be one universal function?

Answers (2)

Kevin Claytor
Kevin Claytor on 9 Nov 2015
Import data seems to work pretty well (but doesn't directly get you the headers):
importdata('radiosonde.ascii', ' ', 3)
If you know the exact format, textscan is used by the auto-generated code by: right click > import data:
startRow = 4;
formatSpec = '%7s%7s%7s%7s%7s%7s%7s%7s%7s%7s%s%[^\n\r]';
dataArray = textscan(fileID, formatSpec, 'Delimiter', '', 'WhiteSpace', '', 'HeaderLines' ,startRow-1, 'ReturnOnError', false);

dpb on 9 Nov 2015
Edited: dpb on 13 Nov 2015
A"The better way..."
I hadn't noted before the symptom of repeated delimiters with dlmread; agreed that's a pit[proverbial]a[ppendage].
IMO, it's unfortunate TMW has chosen to deprecate the use of textread in favor of textscan; it has the advantage of
  1. returning a "regular" double array instead of only a cell array,
  2. doesn't need the extra fopen/fclose step again where a single file read suffices and,
  3. as shown below, it "counts" the record length and returns correct shape automagically whereas textscan has to be told or one has to reshape the returned array.
The above equivalent in textscan would be
x=cell2mat(textscan(fid,repmat('%f',1,11), ...
'delimiter',' ', ...
'headerlines',3, ...
textscan is the one, general function, but there are so many possibilities (as in infinite) to cover that making something that is general but also flexible is difficult; hence the specialized functions for specific cases. It does seem as though the multiple delimiters option would be a worthwhile enhancement for them; as noted, I hadn't actually noted that behavior previously as I tend to use the textread route for the above reasons. There are things it can't do that textscan can (being able to be called on the same file multiple times being a major one) but instead of deprecating it, it should be brought up to the level of textscan instead imo (or, alternatively, the option I've asked for since it was introduced, have an optional ability in textscan to return the double array directly and understand a file name as well as file handle).
Actually, on reading the source for dlmread I observed something hadn't noticed before (and I don't think it's documented; at least not well) -- if one submits an empty string for the formatting string, then textscan will do something else internally and in a regular numeric array come up with the number of fields per input record and reflect that. That is a super result that should be shouted from the rooftops by TMW but seems to be a closely held secret--
>> cell2mat(textscan(fid,'','collectoutput',1,'headerlines',3))
ans =
Columns 1 through 8
994.0000 270.0000 7.0000 6.0000 93.0000 5.9300 40.0000 10.0000
989.0000 312.0000 6.2000 5.2000 93.0000 5.6400 42.0000 12.0000
972.0000 455.0000 4.8000 4.0000 95.0000 5.2700 48.0000 18.0000
Columns 9 through 11
280.6000 297.1000 281.6000
280.2000 295.9000 281.2000
280.2000 294.9000 281.1000
dpb on 12 Nov 2015
BTW, the above working for the example file is sorta happenstance; the documentation also includes the caveat
All data in the input file must be numeric. dlmread does not operate
on files containing nonnumeric data, even if the specified rows and
columns for the read contain numeric data only.
The example file is an anomaly that does, in fact, work correctly when skip the headers; not all will. I've not pursued this part in depth, undoubtedly it has to do with the fact the delimiter search reads an arbitrary 4096 characters and searches within it to determine the delimiter if requested and makes assumptions based on that which may turn out to be incorrect for a general line.

Sign in to comment.


Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!