Strings from a text file to a matrix containing double precision floating numbers

5 views (last 30 days)
Hi
I have a text file containing a text header, and rows containing numeric values, with varying numbers of values, characters and numeric formats:
# Bundle file v0.3
9 2532
6.8302313857e+002 -1.4826175815e-001 8.1715222947e-002
9.3709731863e-001 -2.8772865743e-001 -1.9763814183e-001
194 144 45
5 6 1496 289.0000 199.0000 7 1235 308.0000 125.0000 5 1614 285.0000 163.0000 4 2122 173.0000 142.0000 0 911 148.5000 165.5000
2.4321163035e+000 -9.1469082482e-001 -6.6122261943e+000
219 194 76
I want to remove the header and store each of the numeric values in a matrix (padded out with NaNs to compensate for the dimensional differential). At present, I am using this code:
% open file and save contents to cell array, c
fid = fopen('C:\transform\bundle.out','r');
c = textscan(fid,'%s','delimiter', '','whitespace','');
fclose(fid);
%create m x 1 cell C and remove the header
C = c{1};
C(1,:)=[];
% convert C to a matrix using cell2mat / cellfun
maxLength=max(cellfun(@(x)numel(x),C));
out = cell2mat(cellfun(@(x)cat(2,x,zeros(1,maxLength-length(x))),C,'UniformOutput',false));
The problem with this approach is that it creates a character array where each row is a string meaning that I cannot use str2num or str2double to convert the numeric values to discrete doubles (i.e. it gives [] / NaN due to not passing the arithmetic number test). I.e. it produces:
'9 2532 ';
'6.8302313857e+002 -1.4826175815e-001 8.1715222947e-002 ';
'9.3709731863e-001 -2.8772865743e-001 -1.9763814183e-001';
rather than:
'9' '2532';
'6.8302313857e+002' '-1.4826175815e-001' '8.1715222947e-002';
'9.3709731863e-001' '-2.8772865743e-001' '-1.9763814183e-001';
I can work around this using by seperating each row into a row vector (e.g. out1,..,outn then using:
splitstring = textscan(out1,'%s');
splitstring = splitstring{1};
Then use str2double and flipdim or similar to return rows of doubles, then use vertcat and pad with NaNs to get the desired matrix, but this seems to be very wieldy in the coding department. Can anyone suggest a more simple way of getting the desired output? Any suggestions would be appreciated.
Thomas

Accepted Answer

Thomas Seers
Thomas Seers on 16 Jan 2013
I have worked out the answer for those with a similar problem:
I use textscan and cellfun to split the strings, de-nest and rearrange the output using vertcat and cellfun/transpose, then convert the single strings to doubles using cellfun/str2double:
fid = fopen('C:\transform\bundle.out','r');
c = textscan(fid,'%s','delimiter', '','whitespace','', 'HeaderLines', 1);
fclose(fid);
C = c{1};
C = cellfun(@(x) textscan(x,'%s','Delimiter', ' ')',C ,'UniformOutput',false);
Y = vertcat(C{:});
X = cellfun(@transpose,Y,'UniformOutput',false);
Z = cellfun(@str2double,X,'UniformOutput',false);
The output can be gained using cellfun/cell2mat using a max row length id (maxLength):
maxLength=max(cellfun(@(x)numel(x),Z));
out = cell2mat(cellfun(@(x)cat(2,x,zeros(1,maxLength-length(x))),Z,'UniformOutput',false));
Note this code pads out the values with zeros rather than NaNs.

More Answers (2)

per isakson
per isakson on 15 Jan 2013
Edited: per isakson on 17 Jan 2013
If the file isn't huge (compared to available RAM and address space) and you have an idea of the maximum number of columns "columns" and rows, then I guess the simplest way is to loop over all rows.
M = nan( nrow, ncol ); % allocate memory
fid = fopen( ... );
str = getl( fid ); % header line
row = 0;
while not( eof(fid) )
row = row + 1;
str = fgetl( fid );
val = fscanf( str, '%f' );
M( row, 1:numel(val) ) = val;
end
And trim M. Something like this.
.
[Edit: 2013-01-16]
Working code
Here is a comparison between three solutions. The two first, cssm and cssm1 are along my out-line above. The last, OP, is the one proposed by OP. I run this script a few times.
%%read ragged text file
clc
tic, M1 = cssm; toc
tic, M2 = cssm1( 10000, 100 ); toc
tic, M3 = cssm1( 100000, 1000 ); toc
tic, M4 = OP(); toc
which return
Elapsed time is 0.238691 seconds.
Elapsed time is 0.131869 seconds.
Elapsed time is 0.960397 seconds.
Elapsed time is 0.709025 seconds.
The output is
>> whos
Name Size Bytes Class Attributes
M1 2464x21 413952 double
M2 2464x21 413952 double
M3 2464x21 413952 double
M4 2464x21 413952 double
.
In cssm.m the required number of rows and columns are determined in two separate steps. Each step reads the file. Thus, the function, cssm, reads the file three time.
With cssm1 the number of rows and columns are guessed. In one case the "guesses" are 4x the actual size and in the other 40x.
The function, OP, is OP's code made into a function and ZEROS replaced by NAN to honor the question.
With 2500 rows cssm is three times faster than the loop-free code (OP). cssm is five times faster when allocating 4x4 times more memory than needed and a bit slower than the loop-free code when allocating 40x40 timed more memory.
Conclusions:
  • Loops are not always slow
  • Reading from the file cache is fast.
  • Code with loops are often easier to make and understand (IMO).
  • Don't hesitate to use the RAM if it is available
.
The files involved are
function M = cssm()
fid = fopen( 'cssm.txt' );
cup = onCleanup( @() fclose( fid ) );
cac = textscan( fid, '%s', 'Delimiter', '\n', 'HeaderLines', 1 );
nrow = numel( cac{:} );
clear cup
fid = fopen( 'cssm.txt' );
cup = onCleanup( @() fclose( fid ) );
[~] = fgetl( fid ); % header line
ncol = 0;
while not( feof( fid ) )
ncol = max( ncol, numel( sscanf( fgetl(fid), '%f' ) ) );
end
clear cup
M = cssm_( nrow, ncol );
end
function M = cssm_( nrow, ncol )
M = nan( nrow, ncol ); % allocate memory
fid = fopen( 'cssm.txt' );
cup = onCleanup( @() fclose( fid ) );
[~] = fgetl( fid ); % header line
row = 0;
while not( feof( fid ) )
row = row + 1;
val = sscanf( fgetl(fid), '%f' );
M( row, 1:numel(val) ) = val;
end
end
and
function M = cssm1( nrow, ncol )
M = nan( nrow, ncol ); % allocate memory
fid = fopen( 'cssm.txt' );
cup = onCleanup( @() fclose( fid ) );
[~] = fgetl( fid ); % header line
row = 0;
while not( feof( fid ) )
row = row + 1;
val = sscanf( fgetl(fid), '%f' );
M( row, 1:numel(val) ) = val;
end
M( :, all( isnan( M ), 1 ) ) = [];
M( all( isnan( M ), 2 ), : ) = [];
end
The text file, cssm.txt,contains 2465 line; repetitions of OP's data.
  2 Comments
Thomas Seers
Thomas Seers on 16 Jan 2013
Thanks for your response
Unfortunately, the number of rows is unknown, as is the number of variables and characters in each row (i.e. the example in the original question). A for loop may work, though acting on the cell array might be more RAM friendly. I'll have a look at a possible solution.
per isakson
per isakson on 16 Jan 2013
Edited: per isakson on 17 Jan 2013
I have added working code above to illustrate the approach I proposed.

Sign in to comment.


Ryan Livingston
Ryan Livingston on 15 Jan 2013
Will think more about the harder question of formatting the numeric data but you could use the properties 'CommentStyle' and/or 'HeaderLines' to skip your header.
Missing numeric fields are indeed padded with NaNs by default so doing:
a = textscan(fid, '%f %f %f\n',1,'HeaderLines',1)
returns:
a =
[9] [2532] [NaN]
This is controlled by the property 'EmptyValue'. Getting the right format string and properties will do all of the padding for you.
Could you elaborate on the desired format of the output array? Are you viewing the text file as a matrix and you would like the dimensions to be number_of_lines - by - max_number_of_values (8 -by- 16 in this example) or something else?
  1 Comment
Thomas Seers
Thomas Seers on 16 Jan 2013
Hi
The desired output would be number of rows (unknown) by maximum number of values:
[9 2532 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN;
6.83e+002 -1.48e-001 8.17e-002 NaN NaN NaN NaN NaN NaN NaN NaN NaN;
9.37e-001 -2.87e-001 -1.97e-001 NaN NaN NaN NaN NaN NaN NaN NaN NaN;
194 144 45 NaN NaN NaN NaN NaN NaN NaN NaN NaN;
5 6 1496 289.0000 199.0000 7 1235 308.0000 125.0000 5 1614 285.0000]
With the maximum number of values in this case being 11 (< max row padded with NaN).

Sign in to comment.

Categories

Find more on Characters and Strings in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!