Storing and accessing large amounts of time-based data

Question

Ben on 3 Oct 2016

1
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/305526-storing-and-accessing-large-amounts-of-time-based-data

Answered: Peter Perkins on 14 Sep 2018

General Intro

I am working on a project where the ultimate goal is to analyse data and produce insights on this. However, I'm having a little trouble with the data storage itself, before even starting any analysis. I am hoping someone with more experience who has a feel for this can give some insight.

Data

We are recording ~70 signals from several hydrogen fuel cell vehicles that are driving around. Most of the signals are at 1Hz, about 10 at 5Hz and a handful at slower (0.1Hz, 0.002Hz) frequencies. The cars are driven for anywhere from 5 minutes to 2 hours (roughly) any number of times per day (sometimes 0 for multiple days), but while they are being used the data is being continuously recorded. All signals are doubles.

I currently save everything to one large nested structure (which is good for sharing with coworkers) that is just a nested-cell structure. This structure is inherited from the reading of the RAW signal data. Signals are things like GPS, speed, steering wheel angle, fuel cell voltages etc. The current data structure works well in some ways, but calculating the mean vehicle speed for example requires a 2-level nested for-loop (loops through cars, then trips [trips are vehicle on->vehicle off]) instead of mean(ts) as can be done with timeseries objects. Part of the problem is that the types of analysis that can be performed is very broad so it's known what the specific query functions will be.

The dataset currently covers one year of data from a four year project, to give a sense of the scale. It's also possible more vehicles will be added, increasing the amount of data for analysis.

Desires

Relatively easy to use, for less MATLAB savvy coworkers
Filtering out trips on different conditions, such as: length, signal value, and date
Easy to perform statistical analysis
Ability to perform analysis on entire dataset easily (across different sample frequencies)
Units as a property of a signal
(Very optional) Saved on a server (I have both Windows and Linux servers available) so that users can dynamically query the data

Attempts

I have tried several different approaches to this so far:

Single timeseries collection per car and per sample rate. Used append to make them into single timeseries whilst maintaining proper timestamps (I like how this works when plotting all data). This seems to work alright, unless the PC has limited amounts of RAM....
Single timetable per car, with missing data for resampling filled by NaNs. This is my current in use method, but at 3,629,763; 2,853,868; 1,707,074; rows of data (with 70+ signals), they can use quite a lot of data.

Thoughts

I've been looking at support docs and everything and don't want to put a lot of work in to pursuing the wrong solution.

Use a timeseries collection for each car and resample the data using ZoH (probably causes errors due to massive TSC size in memory)
Use a timetable, but this doesn't seem as applicable for use if you've only got numbers.
Use one of the above, but have each trip in a nested array instead of one big table (ease of use of one big table is nice, though).
Use one of the above, store it in HDF5 data set (not sure if possible) so data is only loaded as needed.
Use a datastore with the above?
Double down on "big data" and start learning map-reduce stuff

Edit

I am no where near running into limits on storage or memory; that is a future problem I will have, but not the important thing to solve now. As mentioned in the comments there's much I can do to reduce size that has nothing to do with structure.

My MAIN QUESTION is about whether I should use timeseries, timetables or non-time-based arrays, and how I should organise and store these. I have many sets of data from each car and I have contained each set (which can be recorded minutes or hours apart) in a single table as this simplifies finding things like the mean of all samples very quickly, but massive tables aren't very nice to memory. So how do I store and organise the data such that I can easily perform analysis? Memory restrictions can be kept in mind but they are definitely not a driving force in this at all. My computer handles what I've described just fine.

2 Comments
Show NoneHide None

dpb on 3 Oct 2016

Edited: dpb on 3 Oct 2016

Is the "~70 signals from several ... vehicles" the current total number of channels or is this per vehicle?

Is there sufficient precision in the measurements to justify double storage or (it would seem highly likely) single would halve the actual memory; shorter integers reflective of the actual A/D bit resolution even more...

Ben on 3 Oct 2016

That's 70 per vehicle.

That is true, I could change about half the signals to simpler data type, but I'm not really having problems with file sizes yet. That will just delay fixing the problem properly, but thanks.

Sign in to comment.

Sign in to answer this question.

Answer 1

Kirby Fears on 3 Oct 2016

2
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/305526-storing-and-accessing-large-amounts-of-time-based-data#answer_237113

Edited: Kirby Fears on 3 Oct 2016

Hi Ben,

You have discussed your ideas and questions regarding the Matlab data structures that allow for easy access and manipulation of your data in memory, but how are you storing this data in the first place on your disk? Are you using a database technology like MongoDB or HBase that can be integrated into map-reduce calculations?

If you're not using a database while generating about 2 million rows times 70 signals per year for each car, you will soon have difficulty loading that data into memory in the first place. Imagine your data doubles or triples. What will you do to get it into memory?

Your querying should be done outside of memory through a database, not by indexing/searching through a complex structure that's been completely loaded into your memory from the start.

If you don't have access to databases managed by a professional database administrator, you might consider using Amazon Web Services (or a competitor) to set up data storage and big data processing without the need to own and maintain all that software & hardware in-house.

https://aws.amazon.com/emr/details/hadoop/

Once your data is stored and managed, it's much easier to talk about the data structures needed for your analyses since you'll probably be working with a lot less data in memory to begin with. Probably one of the data structures you have already used will work fine.

Hope this helps.

8 Comments
Show 6 older commentsHide 6 older comments

Kirby Fears on 3 Oct 2016

Edited: Kirby Fears on 3 Oct 2016

Hi Ben,

InfluxDB will be the authority on how to interact with their software. They may have an ODBC interface or a specific JDBC drivers that you can use in Matlab via the database toolbox. Definitely check with InfluxDB support and your internal users of the technology to see if they have a particular recommendation for your use case.

That being said, the InfluxDB website says they support RESTful communication, which might be the easiest place to start and wouldn't require a special Matlab toolbox. Here are some links from their website describing how to GET or POST data.

https://docs.influxdata.com/influxdb/v1.0/guides/querying_data/ https://docs.influxdata.com/influxdb/v1.0/guides/writing_data/

Matlab can submit GET and POST requests pretty easily with webread. You can specify the RESTful parameters by modifying the queryname/queryvalue pairs or specifying the weboptions.

Members of your org who are currently using InfluxDB might be using the RESTful API already. Either way, someone must have the required connection information. The database admin will probably need to create credentials for you to connect with.

Hope this helps.

Ben on 4 Oct 2016

InfluxDB has no JDBC or ODBC support so that is out. I have tried pulling data from Influx via the RESTful service, however, MATLAB can't currently understand null values so can't directly request the JSON and must first request is as text and then convert (See here) which is too slow to be usable (both local and server PCs are fast enough) because of the confirmed MATLAB bug in fromJSON.

We also previously had a MySQL DB but the long tables with all samples in them made it horribly slow, too slow for the website that queried the DB (made by other subcontractors, not us, so don't think I know anything about using MySQL or making websites) and too slow for loading any data in any kind of reasonable time from the DB.

You seem to think my coworkers can use the DB; they're at the level of basic MATLAB, not having ANY direct interaction with any DB. I don't have much knowledge about the DB and that's being maintained by one guy from a different department, so no. My main question has nothing to do with any DB and I will only use a DB for this purpose unless ABSOLUTELY necessary, as no one who wants to use the data has DB experience (me included).

Do you not know about HDF5 MAT files? h5read

I have no idea how to handle HDF5 stuff, but MAT files v7.3 are also HDF5, and HDF5 allows you to only extract subsets of data from a single file.

tl;dr: I am not really going to be at the stage of worrying about data size for the next 3-9 months. The main point of my original question was about whether I should use timeseries, timetables, different ways of storing and arranging those.

Kirby Fears on 6 Oct 2016

Edited: Kirby Fears on 6 Oct 2016

Hi Ben,

The HDF5 format allows you to retrieve a subtree of the data, but as far as I know it does not give you a query language that can construct a specific dataset across different branches. For example, if data was stored in a nested format by car, day, then trip ID, you'd have no simple way to ask for all trips between 2pm and 4pm on Tuesdays or Thursdays. The only thing you can do is pull each trip into memory, check the time stamp, then choose which ones to keep.

Using a database is beneficial because you can write complex queries that will allow you to pull very specific data into Matlab's memory, not just lower nodes along your tree.

As you pointed out originally, there's no single best Matlab data structure for arbitrary time series analysis - it depends on what operations you want to perform. If you are pulling exactly the data you need into simple time series, you can avoid crosswise operations of manipulation from happening in your local machine's memory, like joining all the 2pm - 4pm trips together into a single time series.

If you were to go ahead with the HDF5 approach, I suggest putting as much query-related information as possible into the tree tags for flexible querying so you can simply stack the results together into a time series or table as needed.

As for the database approach, anything that uses ODBC or JDBC drivers (aside from MySQL) could work. The speed of execution will depend on the database design and hardware as much as the choice of database software. For example, it sounds like your data was stored in 1 large table in MySQL, which makes for very slow searches. I'm not familiar with MonetDB, but I think just about anything will be faster than the MySQL setup. The MonetDB documentation indicates it has ODBC and JDBC drivers. Was MySQL being stored on a user machine or on the server? If server, would MonetDB be running out of the same server?

MongoDB is a well known technology for storing JSON in a document based framework. Each JSON record is separately stored so they can be searched for query tags, like car, day, timestamp, etc, before being returned to you. Mongo can be scaled extremely well, so the sky is the limit for response times if you have the right hardware. You could have 1 record for each trip the car takes, which describes trip start time, end time, what car it was, etc, and also lists all the observations with their timestamps inside.

The Mongo docs are here: https://docs.mongodb.com/manual/core/document/

As for the problem of users who don't know databases - I totally feel you on that one. The best approach for me has been to provide functions to them to use with greatly simplified function parameters. Your function can take care of actually communicating with the database and constructing the right Matlab datatype for the query they performed (some queries might make sense as a table or instead as a timeseries).

This is all I have to add on this topic. Sorry I don't have more specific answers for you. Good luck, Ben.

ESala on 26 Jul 2018

Hi @Ben, @Kirby,

I had the same problem where I needed to access InfluxDB from MATLAB, so I made a small client library.

You can find it on GitHub: https://github.com/EnricSala/influxdb-matlab

Ben on 26 Jul 2018

Hi @ESala, others, Thanks, that's great! I won't be using it anytime soon, but for reference for myself and others: Link on FileExchange and you can find it in the built-in Add-on installer.

Sign in to comment.

Answer 2

Steven Lord on 26 Jul 2018

1
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/305526-storing-and-accessing-large-amounts-of-time-based-data#answer_330466

In release R2017a we added the capability to work with out-of-memory time-stamped data using a tall timetable. Consider storing your data in files that can be used to create a datastore then create a tall array from that datastore and convert or create a timetable from that tall array. See the "Extended Capabilities" section on the documentation page for timetable for more information.

3 Comments
Show 1 older commentHide 1 older comment

Steven Lord on 26 Jul 2018

There were some timetable methods that didn't work for tall timetable arrays in release R2017a. We added tall timetable support for a few more functions (including retime and synchronize in particular) in release R2017b.

If there are particular functions that you want to use on tall table or timetable arrays but they don't work, please let Technical Support know so they can add a request for those specific functions (with your use cases as motivation) to the enhancement database.

Ben on 26 Jul 2018

I will keep that in mind when I start to work on an implementation. Thanks.

Sign in to comment.

Answer 3

Peter Perkins on 14 Sep 2018

1
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/305526-storing-and-accessing-large-amounts-of-time-based-data#answer_336871

Open in MATLAB Online

Consider this:

>> x = rand(100,5);
>> c1 = mat2cell(x,100,[1 1 1 1 1])
c1 =
  1×5 cell array
    {100×1 double}    {100×1 double}    {100×1 double}    {100×1 double}    {100×1 double}
>> c2 = num2cell(x);
>> whos
  Name        Size            Bytes  Class     Attributes
    c1          1x5              4560  cell                
    c2        100x5             60000  cell                
    x         100x5              4000  double

Tables are like c1, which splits x into five 100x1 columns, as opposed to c2, which splits it into 500 separate scalars. Mostly people use tables for a medium-to-large number of observations of a small-to-medium number of variables.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 4

Peter Perkins on 6 Sep 2018

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/305526-storing-and-accessing-large-amounts-of-time-based-data#answer_335640

Ben, I'm coming to this post a bit late, and I don't think I have a silver bullet for you, but I wonder if I can get you to say more about, "Easy to perform statistical analysis" and "Ability to perform analysis on entire dataset easily (across different sample frequencies)". It would be helpful to hear a more detailed description of a concrete example or two of the kind of analysis you need to do.

You say, "as this simplifies finding things like the mean of all samples very quickly". It's often the case that people want to compute descriptive statistics, but for each trip, or each car, or grouped in some other way. Is that what you mean, or do you really mean across "all samples"?

Also, I guess I can understand the appeal of one timetable per car, but as you say, with multi-rate data there's a lot of NaN-filling (or ZOH, or interpolation) to be done, and so I'm wondering why you tried using multiple timeseries collections per vehicle to deal with multi-rate, but not multiple timetables per car?

Finally, I wonder if you could say what's making you conclude that a timetable "doesn't seem as applicable for use if you've only got numbers"?

Thanks for shedding any light, if you can.

3 Comments
Show 1 older commentHide 1 older comment

Ben on 11 Sep 2018

HA. I can't tell you what I was thinking when I wrote that... I can guess, however.

The statistical analysis that I've done is mostly simple stuff like the mean, count and standard dev of signals. What I meant by "across different sample frequencies" is not clear to me. What I have done is grouped other signals according to one signal (such as current gear) and then performing mean etc.

Yes that is what I meant. Now I just use mean(Timetable.Speed) where previously I'd need a loop to do the same. I meant that as the advantage of a single TT/car over a nested structure. I don't think there's any case where I want the mean over all cars, typically my statistics are on a per car, per quarter, per month, per shift lever type etc.

IIRC I essentially did the timeseries with multiple collections per vehicle because it couldn't handle multiple time-rate, or just couldn't handle the volume (on my PC at least). It seems easier to me if they're all in one table, especially when I want to group a 5Hz signal by a 0.1Hz one (making this up as an example) or something.

I think at the time I meant regular arrays are more memory compact and quicker to access than timetables? I don't really know.

I've pretty much settled on timetables now. Though I pull data out of the TT to use grpstats. Retiming a TT from mixed to monthly bins actually covers a bit of the statistics we do.

Peter Perkins on 12 Sep 2018

OK, thanks. It sounds like grouped calculations in one form or another on timetables is working OK for you.

Depending on what kind of data you have, it may not be true that one numeric array uses less memory than a timetable with all numeric column variables. As long as the timetable is not very wide, there's very little extra memory needed to cut the numeric array into columns. Performance differences depends on what you are doing.

But it's not "wrong" to put all numeric variables into a table or a timetable if the variables are conceptually different things (perhaps different units) because in those cases you're unlikely to want to treat them all as one homogeneous matrix.

It all depends on how you want to think of your data.

Ben on 13 Sep 2018

Huh, TIL. I'm curious as to how wide "not very wide" is limited to.

Indeed. The way we think of our data makes Timetables a good option.

Sign in to comment.

Storing and accessing large amounts of time-based data

2 Comments
Show NoneHide None

Answers (4)

8 Comments
Show 6 older commentsHide 6 older comments

3 Comments
Show 1 older commentHide 1 older comment

0 Comments
Show -2 older commentsHide -2 older comments

3 Comments
Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Community Treasure Hunt

Storing and accessing large amounts of time-based data

2 Comments Show NoneHide None

Answers (4)

8 Comments Show 6 older commentsHide 6 older comments

3 Comments Show 1 older commentHide 1 older comment

0 Comments Show -2 older commentsHide -2 older comments

3 Comments Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Community Treasure Hunt

2 Comments
Show NoneHide None

8 Comments
Show 6 older commentsHide 6 older comments

3 Comments
Show 1 older commentHide 1 older comment

0 Comments
Show -2 older commentsHide -2 older comments

3 Comments
Show 1 older commentHide 1 older comment