MATLAB Answers

Storing and accessing large amounts of time-based data

72 views (last 30 days)
Ben
Ben on 3 Oct 2016
Answered: Peter Perkins on 14 Sep 2018
General Intro
I am working on a project where the ultimate goal is to analyse data and produce insights on this. However, I'm having a little trouble with the data storage itself, before even starting any analysis. I am hoping someone with more experience who has a feel for this can give some insight.
Data
We are recording ~70 signals from several hydrogen fuel cell vehicles that are driving around. Most of the signals are at 1Hz, about 10 at 5Hz and a handful at slower (0.1Hz, 0.002Hz) frequencies. The cars are driven for anywhere from 5 minutes to 2 hours (roughly) any number of times per day (sometimes 0 for multiple days), but while they are being used the data is being continuously recorded. All signals are doubles.
I currently save everything to one large nested structure (which is good for sharing with coworkers) that is just a nested-cell structure. This structure is inherited from the reading of the RAW signal data. Signals are things like GPS, speed, steering wheel angle, fuel cell voltages etc. The current data structure works well in some ways, but calculating the mean vehicle speed for example requires a 2-level nested for-loop (loops through cars, then trips [trips are vehicle on->vehicle off]) instead of mean(ts) as can be done with timeseries objects. Part of the problem is that the types of analysis that can be performed is very broad so it's known what the specific query functions will be.
The dataset currently covers one year of data from a four year project, to give a sense of the scale. It's also possible more vehicles will be added, increasing the amount of data for analysis.
Desires
  • Relatively easy to use, for less MATLAB savvy coworkers
  • Filtering out trips on different conditions, such as: length, signal value, and date
  • Easy to perform statistical analysis
  • Ability to perform analysis on entire dataset easily (across different sample frequencies)
  • Units as a property of a signal
  • (Very optional) Saved on a server (I have both Windows and Linux servers available) so that users can dynamically query the data
Attempts
I have tried several different approaches to this so far:
  1. Single timeseries collection per car and per sample rate. Used append to make them into single timeseries whilst maintaining proper timestamps (I like how this works when plotting all data). This seems to work alright, unless the PC has limited amounts of RAM....
  2. Single timetable per car, with missing data for resampling filled by NaNs. This is my current in use method, but at 3,629,763; 2,853,868; 1,707,074; rows of data (with 70+ signals), they can use quite a lot of data.
Thoughts
I've been looking at support docs and everything and don't want to put a lot of work in to pursuing the wrong solution.
  • Use a timeseries collection for each car and resample the data using ZoH (probably causes errors due to massive TSC size in memory)
  • Use a timetable, but this doesn't seem as applicable for use if you've only got numbers.
  • Use one of the above, but have each trip in a nested array instead of one big table (ease of use of one big table is nice, though).
  • Use one of the above, store it in HDF5 data set (not sure if possible) so data is only loaded as needed.
  • Use a datastore with the above?
  • Double down on "big data" and start learning map-reduce stuff
Edit
I am no where near running into limits on storage or memory; that is a future problem I will have, but not the important thing to solve now. As mentioned in the comments there's much I can do to reduce size that has nothing to do with structure.
My MAIN QUESTION is about whether I should use timeseries, timetables or non-time-based arrays, and how I should organise and store these. I have many sets of data from each car and I have contained each set (which can be recorded minutes or hours apart) in a single table as this simplifies finding things like the mean of all samples very quickly, but massive tables aren't very nice to memory. So how do I store and organise the data such that I can easily perform analysis? Memory restrictions can be kept in mind but they are definitely not a driving force in this at all. My computer handles what I've described just fine.

  2 Comments

dpb
dpb on 3 Oct 2016
Is the "~70 signals from several ... vehicles" the current total number of channels or is this per vehicle?
Is there sufficient precision in the measurements to justify double storage or (it would seem highly likely) single would halve the actual memory; shorter integers reflective of the actual A/D bit resolution even more...
Ben
Ben on 3 Oct 2016
That's 70 per vehicle.
That is true, I could change about half the signals to simpler data type, but I'm not really having problems with file sizes yet. That will just delay fixing the problem properly, but thanks.

Sign in to comment.

Answers (4)

Kirby Fears
Kirby Fears on 3 Oct 2016
Edited: Kirby Fears on 3 Oct 2016
Hi Ben,
You have discussed your ideas and questions regarding the Matlab data structures that allow for easy access and manipulation of your data in memory, but how are you storing this data in the first place on your disk? Are you using a database technology like MongoDB or HBase that can be integrated into map-reduce calculations?
If you're not using a database while generating about 2 million rows times 70 signals per year for each car, you will soon have difficulty loading that data into memory in the first place. Imagine your data doubles or triples. What will you do to get it into memory?
Your querying should be done outside of memory through a database, not by indexing/searching through a complex structure that's been completely loaded into your memory from the start.
If you don't have access to databases managed by a professional database administrator, you might consider using Amazon Web Services (or a competitor) to set up data storage and big data processing without the need to own and maintain all that software & hardware in-house.
https://aws.amazon.com/emr/details/hadoop/
Once your data is stored and managed, it's much easier to talk about the data structures needed for your analyses since you'll probably be working with a lot less data in memory to begin with. Probably one of the data structures you have already used will work fine.
Hope this helps.

  8 Comments

Show 5 older comments
Kirby Fears
Kirby Fears on 6 Oct 2016
Hi Ben,
The HDF5 format allows you to retrieve a subtree of the data, but as far as I know it does not give you a query language that can construct a specific dataset across different branches. For example, if data was stored in a nested format by car, day, then trip ID, you'd have no simple way to ask for all trips between 2pm and 4pm on Tuesdays or Thursdays. The only thing you can do is pull each trip into memory, check the time stamp, then choose which ones to keep.
Using a database is beneficial because you can write complex queries that will allow you to pull very specific data into Matlab's memory, not just lower nodes along your tree.
As you pointed out originally, there's no single best Matlab data structure for arbitrary time series analysis - it depends on what operations you want to perform. If you are pulling exactly the data you need into simple time series, you can avoid crosswise operations of manipulation from happening in your local machine's memory, like joining all the 2pm - 4pm trips together into a single time series.
If you were to go ahead with the HDF5 approach, I suggest putting as much query-related information as possible into the tree tags for flexible querying so you can simply stack the results together into a time series or table as needed.
As for the database approach, anything that uses ODBC or JDBC drivers (aside from MySQL) could work. The speed of execution will depend on the database design and hardware as much as the choice of database software. For example, it sounds like your data was stored in 1 large table in MySQL, which makes for very slow searches. I'm not familiar with MonetDB, but I think just about anything will be faster than the MySQL setup. The MonetDB documentation indicates it has ODBC and JDBC drivers. Was MySQL being stored on a user machine or on the server? If server, would MonetDB be running out of the same server?
MongoDB is a well known technology for storing JSON in a document based framework. Each JSON record is separately stored so they can be searched for query tags, like car, day, timestamp, etc, before being returned to you. Mongo can be scaled extremely well, so the sky is the limit for response times if you have the right hardware. You could have 1 record for each trip the car takes, which describes trip start time, end time, what car it was, etc, and also lists all the observations with their timestamps inside.
The Mongo docs are here: https://docs.mongodb.com/manual/core/document/
As for the problem of users who don't know databases - I totally feel you on that one. The best approach for me has been to provide functions to them to use with greatly simplified function parameters. Your function can take care of actually communicating with the database and constructing the right Matlab datatype for the query they performed (some queries might make sense as a table or instead as a timeseries).
This is all I have to add on this topic. Sorry I don't have more specific answers for you. Good luck, Ben.
ESala
ESala on 26 Jul 2018
Hi @Ben, @Kirby,
I had the same problem where I needed to access InfluxDB from MATLAB, so I made a small client library.
Ben
Ben on 26 Jul 2018
Hi @ESala, others, Thanks, that's great! I won't be using it anytime soon, but for reference for myself and others: Link on FileExchange and you can find it in the built-in Add-on installer.

Sign in to comment.


Steven Lord
Steven Lord on 26 Jul 2018
In release R2017a we added the capability to work with out-of-memory time-stamped data using a tall timetable. Consider storing your data in files that can be used to create a datastore then create a tall array from that datastore and convert or create a timetable from that tall array. See the "Extended Capabilities" section on the documentation page for timetable for more information.

  3 Comments

Ben
Ben on 26 Jul 2018
I actually looked into tall tables after writing this and at the time they didn't support some of the functions I use (they still may not support everything). I shall further investigate when appropriate, thanks.
Steven Lord
Steven Lord on 26 Jul 2018
There were some timetable methods that didn't work for tall timetable arrays in release R2017a. We added tall timetable support for a few more functions (including retime and synchronize in particular) in release R2017b.
If there are particular functions that you want to use on tall table or timetable arrays but they don't work, please let Technical Support know so they can add a request for those specific functions (with your use cases as motivation) to the enhancement database.
Ben
Ben on 26 Jul 2018
I will keep that in mind when I start to work on an implementation. Thanks.

Sign in to comment.


Peter Perkins
Peter Perkins on 14 Sep 2018
Consider this:
>> x = rand(100,5);
>> c1 = mat2cell(x,100,[1 1 1 1 1])
c1 =
1×5 cell array
{100×1 double} {100×1 double} {100×1 double} {100×1 double} {100×1 double}
>> c2 = num2cell(x);
>> whos
Name Size Bytes Class Attributes
c1 1x5 4560 cell
c2 100x5 60000 cell
x 100x5 4000 double
Tables are like c1, which splits x into five 100x1 columns, as opposed to c2, which splits it into 500 separate scalars. Mostly people use tables for a medium-to-large number of observations of a small-to-medium number of variables.

  0 Comments

Sign in to comment.


Peter Perkins
Peter Perkins on 6 Sep 2018
Ben, I'm coming to this post a bit late, and I don't think I have a silver bullet for you, but I wonder if I can get you to say more about, "Easy to perform statistical analysis" and "Ability to perform analysis on entire dataset easily (across different sample frequencies)". It would be helpful to hear a more detailed description of a concrete example or two of the kind of analysis you need to do.
You say, "as this simplifies finding things like the mean of all samples very quickly". It's often the case that people want to compute descriptive statistics, but for each trip, or each car, or grouped in some other way. Is that what you mean, or do you really mean across "all samples"?
Also, I guess I can understand the appeal of one timetable per car, but as you say, with multi-rate data there's a lot of NaN-filling (or ZOH, or interpolation) to be done, and so I'm wondering why you tried using multiple timeseries collections per vehicle to deal with multi-rate, but not multiple timetables per car?
Finally, I wonder if you could say what's making you conclude that a timetable "doesn't seem as applicable for use if you've only got numbers"?
Thanks for shedding any light, if you can.

  3 Comments

Ben
Ben on 11 Sep 2018
HA. I can't tell you what I was thinking when I wrote that... I can guess, however.
The statistical analysis that I've done is mostly simple stuff like the mean, count and standard dev of signals. What I meant by "across different sample frequencies" is not clear to me. What I have done is grouped other signals according to one signal (such as current gear) and then performing mean etc.
Yes that is what I meant. Now I just use mean(Timetable.Speed) where previously I'd need a loop to do the same. I meant that as the advantage of a single TT/car over a nested structure. I don't think there's any case where I want the mean over all cars, typically my statistics are on a per car, per quarter, per month, per shift lever type etc.
IIRC I essentially did the timeseries with multiple collections per vehicle because it couldn't handle multiple time-rate, or just couldn't handle the volume (on my PC at least). It seems easier to me if they're all in one table, especially when I want to group a 5Hz signal by a 0.1Hz one (making this up as an example) or something.
I think at the time I meant regular arrays are more memory compact and quicker to access than timetables? I don't really know.
I've pretty much settled on timetables now. Though I pull data out of the TT to use grpstats. Retiming a TT from mixed to monthly bins actually covers a bit of the statistics we do.
Peter Perkins
Peter Perkins on 12 Sep 2018
OK, thanks. It sounds like grouped calculations in one form or another on timetables is working OK for you.
Depending on what kind of data you have, it may not be true that one numeric array uses less memory than a timetable with all numeric column variables. As long as the timetable is not very wide, there's very little extra memory needed to cut the numeric array into columns. Performance differences depends on what you are doing.
But it's not "wrong" to put all numeric variables into a table or a timetable if the variables are conceptually different things (perhaps different units) because in those cases you're unlikely to want to treat them all as one homogeneous matrix.
It all depends on how you want to think of your data.
Ben
Ben on 13 Sep 2018
Huh, TIL. I'm curious as to how wide "not very wide" is limited to.
Indeed. The way we think of our data makes Timetables a good option.

Sign in to comment.