You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
I'm trying to organize data so it can easily be averaged by date
1 view (last 30 days)
Show older comments
I have a cell of data where the first column is the full date. I'm being asked to organize it in a matrix so that if something like H(1,2,:), it would return all the data from the second day of January. The dates are in the format yyyymmddHHMM, if that helps. I'm kind of just looking for guidance on how to achieve something like this. Any help is appreciated
2 Comments
Susan Santiago
on 13 Oct 2018
I uploaded my workspace because there are many files and they're all in .dat which can't be uploaded here. And I think that is probably more clear anyway. The cell I'm concerned with is named C. Each row of C represents on data file.
Accepted Answer
jonas
on 13 Oct 2018
Edited: jonas
on 13 Oct 2018
"...something like H(1,2,:), it would return all the data from the second day of January."
Not very good in my opinion. How do you deal with the fact that different months have different number of days? By padding with NaNs?
It is much easier to put all your data in a timetable. You can then easily access specific days.
t = datetime(2000,1,1):days(1):datetime(2001,1,1);
TT = timetable(t,zeros(length(t),1))
You want to access data for a specific date? Easy:
TT('2001-1-1',:)
ans =
timetable
Time Var1
___________ ____
01-Jan-2001 0
32 Comments
Susan Santiago
on 13 Oct 2018
My professor specifically asked that it be done the way I specified so if possible, I'd like to do it in that way
jonas
on 13 Oct 2018
Edited: jonas
on 13 Oct 2018
Perhaps your professor has never heard about timetables. It is particularily awkward to put your data in the third dimension, making it much harder to access. Also, you have like 100 variables, how do you want to keep track of them?
What does these variables even mean? "TimeStampstart" "TimeStampEnd". Can you clean up your workspace and attacha new .mat file with explanations?
Susan Santiago
on 13 Oct 2018
It's the date, as I mentioned in my original question, it's the first column that I'm using for the time. I've never used timetables before but the main purpose for this is pretty much going to be for plotting like monthly averages for the different variables. Is timetable a good idea for that? And if so, any tips for making this cell into that format?
jonas
on 13 Oct 2018
Yes, timetable is ideal for that. Much better than what your prof. suggested. I'll help you with a nice timetable and some example plots, then you can show that to your professor. I'm sure he will appreciate you finding a better way to solve the problem, or at least he should.
jonas
on 13 Oct 2018
Edited: jonas
on 13 Oct 2018
% Convert time to datetime
t = datetime(num2str(FluxAmeriFluxFormat3{:,1}),'inputformat','yyyyMMddHHmm');
% Put data in timetable
TT = timetable(t,FluxAmeriFluxFormat3(:,3:end))
TT = splitvars(TT);
% Remove columns with categorical data
TT(:,varfun(@iscategorical,TT,'outputformat','uniform')) = [];
% Calculate monthly values
TTmonthly = retime(TT,'monthly','mean')
This gives you the monthly average for all variables. You only gave me one month worth of values, so I the output is just a single row. You can do lots of things with a timetable, interpolation and resampling is just one thing. It's also extremely easy to access the data, because the columns can be called by their variable names. For example
TT.CO2
will give you all values of CO2 and
TT('28-Aug-2018 00:30:00',:)
gives you all variables at that point in time. You want to plot all variables?
plot(TT.t,TT.Variables)
easy as that... and your data is already in datetime format so no need to format the axes.
Susan Santiago
on 13 Oct 2018
This is great but I wanted to do it for all the data, not just from one file. Any way to change this so it works for all the data in cell C? That's where all the data is. In the cells of the first column
Susan Santiago
on 13 Oct 2018
Also is there a way to put out the data like in the example I gave in my original question? Like what would be the indexing be if I wanted to just output all the data from one specific day?
jonas
on 13 Oct 2018
Edited: jonas
on 13 Oct 2018
"Also is there a way to put out the data like in the example I gave in my original question?"
I'm not going to code that, because it is (in my opinion) a poor solution to a simple problem.
"Like what would be the indexing be if I wanted to just output all the data from one specific day?"
That would be simple with a timetable, and extremely annoying with your professors methods, the reason being that the time is not given in days, but in 30-minute intervals. In datetime you can calculate the daily average in a second and access a specific date.
"This is great but I wanted to do it for all the data, not just from one file. Any way to change this so it works for all the data in cell C? That's where all the data is. In the cells of the first column"
That is easy. However, there is one problem, apart from the data being organized in a really awkward way. It is impossible to concatenate the cells to a single array because there is missing data. Take a look at this:
C=
41×3 cell array
{ 71×14 double} { 71×29 double} { 71×8 double}
{332×14 double} {332×29 double} {332×8 double}
{ 59×14 double} { 59×29 double} { 58×8 double}
↑↑ ↑↑ ↑↑
The cell at position (3;3) is missing one value. From this cell array, it is impossible to determine which value that is missing (i.e. the time that no value was recorded). How do you want to deal with the missing data?
After concatenating all cells vertically, I obtained the following sizes: 7892x14, 7880x29 and 7879x8. That means that 12 values are missing from the second set of columns and 13 values from the third set of columns, with no way of knowing the exact row where those values are missing.
For the reason mentioned above, it is simply impossible to concatenate a full array/table/whatever, because of missing values. If you were to solve that somehow, then you would do the following to create a timetable:
idc = cellfun(@(x)strcmp(class(x),'cell'),C(1,:))
C(:,idc) = [];
Ca=vertcat(C{:,1});
% Convert time to datetime
t = datetime(num2str(Ca(:,1)),'inputformat','yyyyMMddHHmm');
% Put data in timetable
TT = timetable(t,Ca)
TT = splitvars(TT);
% Calculate monthly values
TTmonthly = retime(TT,'monthly','mean')
I have only used the 14 first variables here, where there is no missing data.
jonas
on 13 Oct 2018
Edited: jonas
on 13 Oct 2018
One hint I can give you is that you will almost get the same number of rows on all variables if you delete each row with NaNs. I think the rows only differ by 1-2 in the end if you assume all rows with NaN should be deleted.
"yes, I do have raw data. I didn't notice this before but there's some data missing in the files, I don't know I think I should just do it the way my professor wanted."
Good luck with that. I guarantee that you will run into the exact same problem with missing data. An array must be be rectangular/cubic, it cannot have holes, just like a table.
Susan Santiago
on 13 Oct 2018
There are some pieces of data that are missing and show up as NaN when I import the data but I guess the NaN isn't transferred over when it's converted to a cell. I don't know, I think it might just be better to just try to do it the way my professor wants it
jonas
on 13 Oct 2018
Edited: jonas
on 13 Oct 2018
Yes I noticed that, as I wrote in the above comment. However, even if you remove all rows that are full of nans, the dimensions do not line up. Note that doing it "the way your professor wants it" is not going to solve the problem with missing data. I have given you a functional code. I guarantee that no one else will be able to solve this problem in another way, because there is simply missing information.
By the way, do you have a raw file like a .txt or .xls? I bet I would solve this if I could import the data myself.
Susan Santiago
on 14 Oct 2018
This is the first three files. I can add more if you want but I just didn't want it to be overwhelming I guess. Thank you for continuing to try to help me figure this out. I'm kind of freaking out over this haha
jonas
on 14 Oct 2018
Edited: jonas
on 14 Oct 2018
I will stick around until its resolved, if you want to use this approach. However, even if you want to revert to your other approach later, it would probably be wise to structure the data in a timetable first, so that you can calculate daily averages. I suggest importing the files with readtable, because you get everything in a single variable instead of a cell array.
Anyway, those files you gave me all have 55 columns, so they are easy to deal with. Do you have some more complicated files? In the first .mat file, there were multiple cells with different number of columns. Of course, this could also be due my import options... Are all files expected to have the same number of columns?
Will get some sleep now, but will check out this thread tomorrow.
Susan Santiago
on 14 Oct 2018
I have 42 files in total. They all have the same number of columns. The cell C have 5 columns but each of the cells in those columns have the same number of columns in them and they should all add up to 55 as well. They're separated because I imported them using collectoutput and some of the columns are full of NaNs. I hope this makes sense. This is the code I used to create the cell if it helps.
a = dir;
i=3;
n = 1;
while i<=43
d = a(i).name;
A = textscan(fopen(d),'%f %f %f %f %f %f %f %f %f %f %f %f %f %f %s %s %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %s %s %f %f %f %f %f %f %f %f', 'Delimiter',',','Headerlines',1,'CollectOutput',true);
C(n,:) = A(1,:);
n = n+1;
i=i+1;
fclose('all');
end
jonas
on 14 Oct 2018
Edited: jonas
on 14 Oct 2018
This is much better, no annoying missing data. I hope it works for your MATLAB release. Otherwise we have to adjust the code (readtable is constantly updating).
% Write variable names and number of files
VarNames = sprintfc('Var%g',1:55);
FileID = dir('*.dat')
% Write files and store in table
T = table;
for j=1:length(FileID)
% Read file
opts = detectImportOptions(FileID(j).name);
Ts = readtable(FileID(j).name,opts,'ReadVariableNames',false);
% Set variable names
Ts.Properties.VariableNames=VarNames;
% Remove columns full of 'NAN'
idc = varfun(@(x)strcmp(class(x),'double'),Ts,'outputformat','uniform');
Ts(:,~idc)=[];
% Concatenate, (preallocate for speed)
T = [T;Ts]
end
% Convert time to datetime
t = datetime(num2str(T{:,1}),'inputformat','yyyyMMddHHmm');
% Put data in timetable
TT = timetable(t,T(:,2:end));
TT = splitvars(TT);
TT = sortrows(TT);
% Calculate monthly values
TTmonthly = retime(TT,'monthly','mean');
Susan Santiago
on 14 Oct 2018
Edited: Susan Santiago
on 14 Oct 2018
it's not working, I think it might be because I only sent you the first three files that you logically assumed all the files were named the same thing but that's actually not true, I should have made that clear, my mistake. That was why I used dir. They're named mostly the same thing but the last number changes from 0 up to 7. I should add that it works fine for the first three files but then it stops when it tries to get to _0(4) since that file doesn't exist.
jonas
on 14 Oct 2018
Edited: jonas
on 14 Oct 2018
Oh ok, its not too hard to fix that, you already did it when you made your first import. This should do it for you, replacing the first part of the script:
% Write variable names and number of files
VarNames = sprintfc('Var%g',1:55);
FileID = dir('*.dat')
% Write files and store in table
T = table;
for j=1:length(FileID)
% Read file
opts = detectImportOptions(FileID(j).name);
Ts = readtable(FileID(j).name,opts,'ReadVariableNames',false);
...
This is assuming all .dat files in your current dir are files that you want to import. I've updated the previous post. I'd be interested to know if you manage to compile all of the files. If not, then you could send me some of the more problematic files.
Susan Santiago
on 14 Oct 2018
I'm attaching all the files just to help you get a better idea of what's going on. I think there's still an issue. I don't know why but all the dates say April in the TT which definitely shouldn't be the case. Thanks so much
jonas
on 14 Oct 2018
Edited: jonas
on 14 Oct 2018
Ops, my bad. I've updated the code. It seems to be working now. Finally! :)
![](https://www.mathworks.com/matlabcentral/answers/uploaded_files/193166/image.jpeg)
Just add the appropriate variable names and you should be golden. Now, if you really want to revert to the "other format", then I'd first calculate the daily average.
TTdaily = retime(TT,'daily','mean');
then it should be fairly simple to reshape the table to an array. But as I've said a few times, in my honest opinion this timetable format is much much superior as you retain information about variable names and it's much easier to perform calculations and plot in timetable format.
Susan Santiago
on 14 Oct 2018
Edited: Susan Santiago
on 14 Oct 2018
Thank you so so much! Just a few more things. What is splitvars doing? when I tied running the code, it said it was undefined? I'm using Matlab R2017b idk if it's different for that version? Also could you just help me out with indexing a little bit like what would I write if I wanted to plot the average value for one variable over a month, in general? It seems like the dates are still out of order? I see that sortrows is used so idk why that is but when I run it, the dates are all over the place
jonas
on 14 Oct 2018
Edited: jonas
on 14 Oct 2018
Splitvars was introduced in 2018a. When a table is used as input to timetable it causes all variables to be concatenated in the same column. Splitvars simply splits the variables in separate columns. Sortrows is much older and should sort the table by dates in ascending order.. I will show you some indexing when your table is complete!
See attached m-file for new code. I have used table2timetable, introduced in 2016b, instead of splitvars. The variable TT should contain a sorted table having about 10k rows.
For monthly values you simple calculate the monthly average,
TTmonthly = retime(TT,'monthly','mean');
and then plot your desired variable
plot(TTmonthly.Properties.Rowtimes,TTmonthly.VariableName)
just replace "VariableName" with the actual variable names. You can set your own variable names by
TT.Properties.VariableNames = {var1,var2,var3...var51}
I opted to not import the variable names directly, because only some files had variable names as header.
Susan Santiago
on 14 Oct 2018
so I renamed the variables by their actual names but in the line "TT = table2timetable(T,'RowTimes',T.Var1);" I don't know what to change the T.Var1 to because it says it doesn't recognize the variable. And I know you've said this isn't the best way to do it but could you show me how to turn this into the array because while this is great I don't think it's really what my professor was looking for. I really appreciate all the help
jonas
on 14 Oct 2018
Edited: jonas
on 14 Oct 2018
What goes here
TT = table2timetable(T,'RowTimes',T.Var1)
↑
is the variable name of the column where the times are stored. Look at your table and replace the name with whatever it is called (probably t or time or something else you have named it). You can print the first row of your table and look what its called. Here is mine:
>> T(1,1)
ans =
table
Var1
____________________
02-Apr-2018 11:30:00
See that it says Var1 on top of the column? This is the name of the column.
Well, I will give it a try after you have:
- produced the correct table (becauase you are going to need it for the next step) and
- accepted the answer (because quite honestly the question was answered yesterday and now I'm just teaching you basic syntax).
Susan Santiago
on 14 Oct 2018
sorry, I completely forgot about accepting the answer. I renamed it T.TIMESTAMP_START since that's what I called the variable and I received this error "Error using table2timetable (line 74) Row times must be specified as variable name or index, or as datetime or duration vector."
jonas
on 14 Oct 2018
Edited: jonas
on 14 Oct 2018
You cannot name it "T.TIMESTAMP_START". Any column name must be a valid variable name and cannot contain the character ".". The name is actually "TIMESTAMP_START" without the T. When you have finally figured out how to make the table, then add these lines of code to get your weirdo matrix.
TTdaily = retime(TT,'daily','mean');
A = nan(12,365,50);
d = day(TTdaily.Time);
m = month(TTdaily.Time)
for i=1:numel(m);
A(m(i),d(i),:) = TTdaily{i,2:end};
end
A reference to
A(9,1,:)
gives you all variables (daily averages) at the 1st of september. This code will only work if all data is obtained within a single year. However, if the data is spanning several years, then this indexing method is borderline lunatic.
I will reply a bit less frequently now. I believe you have all the tools necessary to solve this on your own, so try yourself and ask if it doesn't work out, but don't expect instant reply.
Good luck!
PS: I've attached a .mat file with the organized data, so you have something to deliver if you cannot figure out how to generate it on your own machine ;)
Susan Santiago
on 14 Oct 2018
I should have been more clear, I named the variable "TIMESTAMP_START" so I put in TT = table2timetable(T,'RowTimes',T.TIMESTAMP_START) and that was what I had put in when I got the error that I reproduced here. I just want to be clear that I 100% understand why you prefer the table and I think it's a lot more clear but I just don't think it was what was supposed to do. You have really went above and beyond helping me with this problem and I am very grateful
jonas
on 14 Oct 2018
Edited: jonas
on 14 Oct 2018
Weird, did you double check that the variable name is correct? Just type T(1,1) after Matlab crash. Paste the output here and Ill check tomorrow.
Thats fine, if you have to delivery that then thats what you should do. Im just trying to teach good coding in general. I prefer to keep the timrtable in the main answer because others who find this thread in the future are probably better off opting for that approach.
Susan Santiago
on 15 Oct 2018
Edited: Susan Santiago
on 15 Oct 2018
Thanks! One last thing, is there any way to change my weirdo code so it's not just giving a daily average but all the results from the day? One of the main uses with this matrix is gonna be plotting the data. Thanks again. And if you don't mind, how would I get just one variable from matrix?
jonas
on 15 Oct 2018
Edited: jonas
on 15 Oct 2018
1. Yes and no. You have data every half hour if I remember correctly. You could make a fourth dimension of the matrix, and enter the "hour of day". Still, it would not work because you have half hours. You could make a fifth dimension called "minute of day"... you realize how absurd this method is becoming, especially since most 98% of minutes would be NaNs.
2. All variables are stored in the third dimension:
A(1,1,5)
outputs the "fifth" variable form the first of January. What is the fifth variable? You would have to compare with some kind of table every time you want to extract one variable.
I fully understand that you want to comply with your professors instructions. However, if you show him these two methods and explain the advantages of using tables (indexing by variable names, options for interpolation, easier access to specific dates, possibility of storing different classes, easier to plot as well as a variety of table-specific options that we have not even talked about) he/she would be crazy stubborn to opt for the array.
More Answers (1)
Peter Perkins
on 17 Oct 2018
"return all the data from the second day of January"
Imagine having this timetable:
>> tt = array2timetable(rand(100,2),'RowTimes',datetime(2018,1,1,0:8:792,0,0));
>> head(tt)
ans =
8×2 timetable
Time Var1 Var2
____________________ _______ ________
01-Jan-2018 00:00:00 0.85071 0.55903
01-Jan-2018 08:00:00 0.56056 0.8541
01-Jan-2018 16:00:00 0.92961 0.34788
02-Jan-2018 00:00:00 0.69667 0.44603
02-Jan-2018 08:00:00 0.58279 0.054239
02-Jan-2018 16:00:00 0.8154 0.17711
03-Jan-2018 00:00:00 0.87901 0.66281
03-Jan-2018 08:00:00 0.98891 0.33083
In recent versions of MATLAB (R2018a and later IIRC), you can do this:
>> tt(timerange('01-Jan-2018','day'),:)
ans =
3×2 timetable
Time Var1 Var2
____________________ _______ _______
01-Jan-2018 00:00:00 0.85071 0.55903
01-Jan-2018 08:00:00 0.56056 0.8541
01-Jan-2018 16:00:00 0.92961 0.34788
>> tt(timerange(datetime(2018,1,1),'day'),:)
ans =
3×2 timetable
Time Var1 Var2
____________________ _______ _______
01-Jan-2018 00:00:00 0.85071 0.55903
01-Jan-2018 08:00:00 0.56056 0.8541
01-Jan-2018 16:00:00 0.92961 0.34788
In earlier versions, you can do the same thing, with a bit more typing:
>> tt(timerange(datetime(2018,1,1),datetime(2018,1,2)),:)
ans =
3×2 timetable
Time Var1 Var2
____________________ _______ _______
01-Jan-2018 00:00:00 0.85071 0.55903
01-Jan-2018 08:00:00 0.56056 0.8541
01-Jan-2018 16:00:00 0.92961 0.34788
See Also
Categories
Find more on Dates and Time in Help Center and File Exchange
Tags
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)