How to extract column data from a csv file in vectors

6 views (last 30 days)
Hello, I have csv files, here the data are similar to the extract of a file attached as an example. I want to extract the data of some of the columns in vectors and if there are empty values to be completed by NaN. I use the following code, but it needs modification to do what I need. How should I modify the code? thanks-k.
fileID = fopen('EXTRAC1985.csv')
C=textscan(fileID,'%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s','delimiter',',','EmptyValue',NaN);
fclose(fileID)
dat1= C{1}(3:end)
dat2= C{10}(3:end)
the output is like: dat2 =
'9'
''
'9'
''
'9'
''
'9'
''
'10' etc

Accepted Answer

Guillaume
Guillaume on 18 Sep 2014
There are two problems:
  1. There are 25 columns in your file and you've only specified 18 in your textscan. You need to read all 25 columns otherwise textscan gets confused.
  2. EmptyValue does not apply to string fields which is what you've specified for all columns with %s. Note that you don't need 'EmptyValue', NaN since it's the default anyway.
You can either:
  • skip the file header and modify your textscan to read the data with the proper format:
fileID = fopen('EXTRAC1985.csv');
fgetl(fileID); %skip first line
C=textscan(fileID,'%s %s %f %f %s %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f','delimiter',','); %use %f to specify numbers. Note that there are 25 %, not 18.
fclose(fid);
dat1 = C{:, 2}; dat2 = c{:, 10};
  • keep as is and use str2double to change the read data from string to number. str2double will replace empty strings with NaNs:
fileID = fopen('EXTRAC1985.csv')
C=textscan(fileID,'%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s','delimiter',','); % 25 %s
fclose(fid);
dat2 = str2double(c{10}(2:end));

More Answers (1)

dpb
dpb on 18 Sep 2014
Edited: dpb on 18 Sep 2014
Don't say which columns are of interest, but you'll have much more success if you use format string for the types of data by column and match the number of fields to the number of columns in the file...there are a number of empty columns at the right it appears, but ignoring them in the format string causes the extra blank lines you've gotten.
>> fmt=['%s %s %f %f %s' repmat('%f',1,18)];
>> fid=fopen('extrac1985.csv');
>> c=textscan(fid,fmt,'delimiter',',','headerlines',1,'collectoutput',1);
>> whos c
Name Size Bytes Class Attributes
c 1x4 18192 cell
>> fid=fclose(fid);
textscan puts the sets of text and numeric together so the first column cell array is the first two text columns then the numeric id, the text and finally the remaining numeric array.
To select columns in the end, either--
a) read whole as above and then just set unwanted columns to [], or
b) modify the format string and skip unwanted columns by use of the '%*s' or '%*f' as appropriate in the format string to not return unwanted columns.
doc textscan
for more detail; look at the link to 'format options' for the gory details.
ADDENDUM
OK, to skip some columns I had a few minutes...get rid of the last number of empty columns and a few that are identical thruout--
>> fmt=['%s %*s %f %f %*s' ...
repmat('%*f',1,3) repmat('%f',1,10) repmat('%*f',1,5)];
>> c=textscan(fid,fmt,'delimiter',',','headerlines',1,'collectoutput',1);
>> whos c
Name Size Bytes Class Attributes
c 1x2 8600 cell
>> c{1}(1:10)
ans =
'01/01/1985 0:00'
'01/01/1985 1:00'
'01/01/1985 2:00'
'01/01/1985 3:00'
'01/01/1985 4:00'
'01/01/1985 5:00'
'01/01/1985 6:00'
'01/01/1985 7:00'
'01/01/1985 8:00'
'01/01/1985 9:00'
>> c{2}(1:10,:)
ans =
Columns 1 through 9
892702 1 360 9 NaN 17 NaN 9 9
892702 1 360 9 NaN 16 NaN 9 9
892702 1 360 9 NaN 14 NaN 9 9
892702 1 350 9 NaN 15 NaN 9 9
892702 1 360 10 NaN 15 NaN 9 9
892702 1 360 10 NaN 16 NaN 9 9
892702 1 10 9 NaN 16 NaN 9 9
892702 1 360 10 NaN 15 NaN 9 9
892702 1 360 9 NaN 15 NaN 9 9
892702 1 360 9 NaN 16 NaN 9 9
Columns 10 through 12
0 9 0
0 9 0
0 9 0
0 9 0
0 9 0
0 9 0
0 9 0
0 9 0
0 9 0
0 9 0
>>
  2 Comments
Katerina F
Katerina F on 19 Sep 2014
Hi, I am interested in keeping the columns A D F H J N. (I do not understand how the selection with the repmat works in this case, that is what you are doing with the line with the repmat, i.e. for example what the 1 and the 5 do in repmat('%*f',1,5)) etc. thanks,k
Note that I do not know the lenght of the data in the files (this file is just an example).
dpb
dpb on 19 Sep 2014
Edited: dpb on 19 Sep 2014
The repmat is simply replacing the explicit typing of N copies of the string with a counted number..."1,5" in this case is 1-row, 5 columns. See
doc repmat % for details
and/or simply try it at the command line and see what the result is.
...I do not know the length of the data in the files...
You mean rows or columns by "length" in this context? Number of rows is immaterial, the number of columns is more significant as the format string used controls the arrangement in the return cell array--having that mismatch caused all the extra blanks in your sample case you posted.
Mohammad's suggestion with the new table data structure may be convenient if you're using a recent-enough release (R2013+ I think; I am at R2012b here so don't have the full facility to see what it does with all the trailing empty columns).
It will, as far as I know, read the full array but it would be relatively easy as noted to select the columns wanted from that. OTOH, if you know the specific columns wanted, using the '%*' form of the format string for the unwanted columns will remove them from consideration at the git-go. You do need to know the actual number of columns total, however. If worst comes to worst, you can always read in the first line and count commas to ascertain that value.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!