Re-synchronizing TEXTSCAN

There are any number of questions posed for reading irregularly or segmented text files on the forum. Often the response is "use textscan in a loop" but there are some issues there that also continually arise. A case of my own just now has raised a particular one that prompts the present query...
The file in question is tab-delimited, daily weather records with a header line for each day... almost perfect fodder for readtable excepting the header line also contains the month and day over the time columns which isn't header but data that breaks the builtin solution. A sample of the file is
Dec 15 Temperature Dew Point Humidity Wind Speed Gust Pressure Precip. Rate. Precip. Accum.
12:10 AM 12.5 °F 2 °F 64 % ESE 0 mph 3 mph 30.7 in 0 in 0 in
12:25 AM 12.1 °F 3 °F 66 % ESE 0 mph 1 mph 30.69 in 0 in 0 in
...
11:44 PM 27.5 °F 23 °F 84 % ESE 2 mph 9 mph 30 in 0 in 0 in
11:58 PM 27.5 °F 23 °F 84 % ESE 2 mph 7 mph 29.98 in 0 in 0 in
Dec 16 Temperature Dew Point Humidity Wind Speed Gust Pressure Precip. Rate. Precip. Accum.
12:13 AM 27.5 °F 24 °F 85 % SE 1 mph 5 mph 29.98 in 0 in 0 in
12:28 AM 27.8 °F 24 °F 86 % East 3 mph 5 mph 29.96 in 0 in 0 in
...
The following code snippet successfully reads and returns the numeric data in an array
nHdr=0; % fixup for header lines to skip/not...
fid=fopen(fn,'r');
l=fgetl(fid); % get first header line
C=textscan(l,'%s',1,'delimiter','\t'); % get month day string
while ~feof(fid)
data=textscan(fid,fmt,'collectoutput',1,'headerlines',nHdr,'delimiter','\t'); % read body of data
L2=size(data{2},1); % how many lines found in this group of numeric data?
dn=datenum([repmat(['2016' char(C{:})],L2,1) char(data{1}(1:L2))],'yyyymmmdd HH:MM AM'); % convert times
if nHdr==0, nHdr=1; end % kludge to re-synch the file marker after failure leaves in mid-record
if size(data{1},1)>size(data{2},1) % another fixup to get rid of extra record of subsequent day
C=([data{1}(end,:)]); % ok, is another group going to come, get the month/day
end
[~,ix]=ismember(data{3},wdirs); % this just converts the alpha wind dir to numeric for convenience
wdir=360-(ix-1)*22.5;
dd=[dd;[dn data{2} wdir data{4}]]; % and mush all numeric together in one long array
end
fid=fclose(fid);
As can be seen, there's a lot of fixup needs be done and the above is only as "clean" as it is owing to the fact that the first column is a string variable and the format of the time column is also a single string so the first cell array holds an extra element when there is another block of data extant in the file; if the data format were of other format this wouldn't work, either.
So, with that as preamble, the question is--
Why is there not reliable way to "re-synch" textscan (and friends using file handles) to beginning of record? If there were, then the above machinations and similar ones undertaken for so many of the aforementioned other special cases we see at Answers would become trivial; when the textscan operation fails, an instruction to reset the file position indicator to beginning of the record would then let the next loop iteration issue a "clean" repeat of the same, identical call. The obvious syntax would be something like that of fseek but with a system-sensitive number of records instead of bytes.
One can use in this case fseek and go back some 6 or 8 bytes but it's empirical because the number of characters in the field isn't consistent whereas the i/o subsystem should be able to find the record terminator essentially trivially.

Answers (1)

Kirby Fears
Kirby Fears on 23 Dec 2016
Edited: dpb on 23 Dec 2016
dpb,
I made a generalized solution to this problem some time ago for the constant questions about parsing delimited text files. It uses text scan to read each row as a string, then split according to the given delimiter into a row of strings. If you request numeric or mixed output, the function attempts to convert each entry to a number and does not convert cells that cannot be interpreted as a number. This means the user does not need to specify a format string or even know the NxM dimensions of the file.
result = delimread('test.txt','\t',{'raw','mixed'});
In your example, none of the degree data is directly convertible to numbers; check out result.mixed when test.txt does contain mixed types. You can use document-specific reasoning to drop the rows with repeated headers afterward. Does this address all of the use cases you had in mind?

7 Comments

I don't know I could possibly answer the question of whether it could resolve "all" cases; don't rightfully think that possible, probably, but maybe "magic happens"... :)
But, it apparently is revision specific; I'm still at R2012b owing to restricted platform issues..
>> out=delimread('dec2016.dat','\t',{'num','text'});
Undefined function 'strsplit' for input arguments of type 'cell'.
Error in delimread/@(c)strsplit(c,delim,'CollapseDelimiters',false)'
Error in delimread (line 168)
outdata=cellfun(@(c)strsplit(c,delim,'CollapseDelimiters',false)',...
>>
I've not tried to see what it would take to get it to run under R2012b. Nor have I extensively read code; what do you do about allocating output; seems like might be costly owing to the line-by-line parsing in building arrays of unknown size, maybe?
Kirby Fears
Kirby Fears on 23 Dec 2016
Edited: Kirby Fears on 23 Dec 2016
Note that I only asked if it addresses all of your use cases.
I'm sorry to hear that strsplit was not available in 2012b. I've only tested back to 2015a. You could possibly replace any 2012b-incompatible functions one at a time to see if it can be adapted to your version.
The output is preallocated based on the number of rows in the file and maximum number of columns used by any row. It runs quickly.
dpb
dpb on 24 Dec 2016
Well, can't really tell, use cases arise every time there's a new file format as to whether it would/wouldn't handle it... :)
Being the holiday weekend and that the above was "just playing" with some data from the home weather station over a period of a really, really unusual temperature and wind swings that was the source of the data, it's not likely I'll be spending any great deal of time over the weekend testing/converting... :)
Mayhaps when the grandkids are gone away I might look into the function in some more depth.
Nothing meant as being denigrative at all...looks like a clever workaround that should solve many of the issues if not all...
dpb
dpb on 27 Dec 2016
Edited: dpb on 27 Dec 2016
"..strsplit was not available in 2012b."
Just for starters to try to make any conversion a little less effort, what does the return at that location from strsplit look like, Kirby? I've already a home-rolled utility that returns fields from a string with given delimiter, but for a quickie purpose it simply uses a character arrray to build the output into...perhaps just cellstr on it might be a kludge, depending on what strsplit returns for your function...
dpb,
Here's an example of strsplit.
strsplit('col1,col2,col3',',')
ans =
'col1' 'col2' 'col3'
The result is a 1 by 3 cell array, each containing a char array.
Thanks, Kirby. Over the holidays I did download R2014b which will run (albeit fairly slowly but far better than I had expected) on this old hardware to which I'm currently limited so can 'spearmint from there.
I still think having a way to get textread back on track would be aGoodThing (tm) as a general facility.
Kirby Fears
Kirby Fears on 5 Jan 2017
Edited: Kirby Fears on 5 Jan 2017
I totally agree that a new utility (like revamped textread) should be included in future releases. The existing functions don't combine the 3 required aspects very well: (1) easy API, (2) capable of reading wide variety of quirky delimited files, and (3) fast.
textscan is fully flexible and fast, but the API is opaque for most users. It also tends to require post-read manipulations. readtable is easy and fast but not as flexible as textscan. Most users I work with choose xlsread to get (1) and (2) at the expense of speed. The delimread function is much faster than xlsread with similar usability and flexibility.
The delimread parsing logic could be used in a c/mex-based function to make it faster while providing an easy API to read virtually any delimited file. Certainly aGoodThing (tm) for future Matlab releases.

Sign in to comment.

Categories

Find more on Data Import and Analysis in Help Center and File Exchange

Products

Asked:

dpb
on 23 Dec 2016

Edited:

on 5 Jan 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!