Time and Space Efficiency Problem in Multi-Format .csv file Splicing and Saving
Show older comments
Hi,
I'm recently dealing with a bunch of csv files which are stored in different folders named by date sequence. The aim is to splice the data with the same csv filename (among all folders) together and save for further use.
First, I go over all folders and use unique function to derive a list of .csv file names. Next, I splice the data with identical filename according to date sequence.
data=[];
for i = 1:num_folder
data = [data; readtable(tmp_file)]; % Splicing
end
save(strcat(pwd,'\',filename,'.mat'),'data') % Saving
The saving is -v7.3 by default setting. (the file size range from hundreds of Mbs to several Gbs)
The whole process is running quite slow. I checked the profile viewer and it turns out save function consumed 70% of time and 'splicing' the other 30%. I would love to know how to elevate the time&space efficiency of my code.
[P.S.: The original .csv file have 60 columns. The Majority of them are numbers (double). One is filled with 'true' or 'false'; one filled with 'HH:MM:SS.TTT' and the other filled with 'HH:MM:SS.TTTTTT'. (Since I read them with readtable function, they are save in the table format in a cell manner.) I tried to 'simplify' the table by converting all elemets into double (I thought it could save some space?) but failed in correctly attaining all microseconds (TTTTTT). The textscan can only gives me HH, MM and SS.TTTT.]
4 Comments
Need to see a short section of one of the text files to know about the specifics of the time format. datetime can't store times alone so would need to convert to a duration class...they can store up to 9 digits for the fractional seconds.
For SAVE, heterogeneous data types as the table create additional overhead and the default compression also takes time.
IF the numeric data aren't needed with a full 15(+/-) digits of precision, you can save half the storage space for them by using single precision.
You can also save a sizable amount of storage by converting the T/F variables to categorical; just checking a table with 40 T/F stored as cellstr() occupied 5498 bytes in memory; the same table content as a categorical variable was only 1246 bytes; there's a littlel overhead associated with a categorical array, but once the size gets to be more than a handful, the storage savings grow very rapidly.
The most concise (and fastest) storage although less convenient would be to write the data as native data types to a stream (aka 'binary') file; compression then would probably not gain too much unless there are long sections of identical values in the data.
Frank
on 1 Feb 2023
Frank
on 1 Feb 2023
Accepted Answer
More Answers (1)
Per usual when things are not totally plain vanilla in some fashion, you need to make use of the extended features supplied in MATLAB. We didn't need but a half-dozen lines to see the file format; I made a shorter version of the original by the following code (which post as may give you some ideas regarding the "splicing" operations you spoke of)
fid=fopen('OI401.csv','r');
fod=fopen('OI401_1.csv','w');
for i=1:100, fwrite(fod,fgets(fid)); end
fclose('all')
to leave me with a big enough file to do something with but still not be huge...
Anyways, that done,
opt=detectImportOptions('OI401.csv'); % create basic import object
opt=setvartype(opt,{'tradingday','logDay'},"datetime"); % set datetime data type
opt=setvaropts(opt,'tradingday','logDay'},"InputFormat",'yyyyDDDD'); % and the input format
opt=setvartype(opt,'isTrading','logical');
opt.SelectedVariableNames=opt.VariableNames([1:6 end-4:end]); % just pick a small subset for demo
tOI=readtable('OI401_1.csv',opt); % and read in with this help
The above returned
>> head(tOI)
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
__________ __________ _________ ____ ____ ____ _______________ _______________ ________ ______ _________
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:21.430146 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:31.430749 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:41.433568 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:51.431701 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:01.431491 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:11.436309 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:21.435230 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:31.438434 20230117 0 true
>>
And, I didn't think to do it on the import option object, so to prove what we got,
>> ans.tradingday.Format='default'
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
__________ ___________ _________ ____ ____ ____ _______________ _______________ ________ ______ _________
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:21.430146 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:31.430749 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:41.433568 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:51.431701 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:01.431491 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:11.436309 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:21.435230 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:31.438434 20230117 0 true
>>
shows the dates were imported as datetime.
Now, as far as what to use for analyses, that'll be your call and will depend heavily upon what you intend to do with the data once you have it. There are all kinds of builtin analysis tools for tables that let you select variables by variable name that are quite convenient; whether they'll be of any direct use for your purposes will all depend upon what those purposes are...
14 Comments
dpb
on 1 Feb 2023
Back to the previous comments on memory -- given the data are financial, there would appear to be no need for double precision so using single precision for the numeric data would seem worthwhile (as long as any toolbox you're planning on using is extended to handle it; singles were red-haired stepchildren for a long time; not sure all toolboxes even yet are fully supportive).
Frank
on 2 Feb 2023
Frank
on 2 Feb 2023
dpb
on 2 Feb 2023
Look at the posted section again; the two time-of-day fields are already converted to durations and include the precision of the inputs to the five decimal places. The datenum data type as the double has insufficient precision past milliseconds to store that resolution.
If the date format is actually year/month/day, then you need to read the doc for datetime input format -- the abbreviations for it are NOT the same as those for datenum; the lowercase "m" is a minute field, NOT month--months are capital "M". That's where the minutes is coming from; you should have gotten a warning about an apparent time field in the date position portion when you set that input format.
Is the input format for the date actually in yyyyMMdd and not the day of year? If so, then use
opt=setvaropts(opt,'tradingday',"InputFormat",'yyyyMMdd');
instead and you can compute the actual date/time of the observation as
tOI.Update=tOI.tradingday+tOI.updatetime;
Read the doc's carefully...
Frank
on 2 Feb 2023
>> opt
opt =
DelimitedTextImportOptions with properties:
Format Properties:
Delimiter: {','}
...
Variable Import Properties: Set types by name using setvartype
VariableNames: {'updatetime', 'tradingday', 'mktstatus' ... and 62 more}
VariableTypes: {'duration', 'datetime', 'double' ... and 62 more}
SelectedVariableNames: {'updatetime', 'tradingday', 'mktstatus' ... and 62 more}
...
>>
Dumping the import object shows that R2020b recognized them as durations automagically...I'd recommend upgrading if R2017x doesn't; but you can work around it by importing as string instead and then manually converting to duration...there are quite a number of other niceties that have been introduced since then that could be helpful...
>> opt=setvartype(opt,'updatetime','string');
>> tOI=readtable('OI401.csv',opt);
>> head(tOI)
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
______________ __________ _________ _______ _______ _______ _______________ _______________ ________ ______ _________
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:21.430146 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:31.430749 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:41.433568 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:51.431701 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:01.431491 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:11.436309 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:21.435230 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:31.438434 20230117 0.00 true
>> ans.updatetime=duration(ans.updatetime)
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
__________ __________ _________ _______ _______ _______ _______________ _______________ ________ ______ _________
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:21.430146 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:31.430749 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:41.433568 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:51.431701 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:01.431491 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:11.436309 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:21.435230 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:31.438434 20230117 0.00 true
>> ans.tradingday+ans.updatetime
ans =
8×1 datetime array
20230117
20230117
20230117
20230117
20230117
20230117
20230117
20230117
>> ans.Format='default';
>> ans.Format=[ans.Format '.SSS']
ans =
8×1 datetime array
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
>>
Gets you the absolute time vector in all its glory.
There's no point in trying to take datetime back to the big bang; there's no practical use for datetime(0,0,0) and certainly no need/use for it here.
dpb
on 2 Feb 2023
And I'd not worry about time comparisions with datetime unless (and until) profiling has proven it to be a performance bottleneck. I'll venture it won't be and can do direct comparisons of equality, or greater/less than with builtin operator support and there's the specific isbetween as well that can be helpful.
Use the MATLAB facilities as intended/supplied first; THEN if the application is too slow you can look into trying to optimize it. But, premature optimizations before even know there is a problem are likely to be misguided and provide little actual improvement at best and make coding/debugging/performance worse than the straightforward implementation at worst.
Actually, wasn't thinking -- the way if R2017x readtable still doesn't know about importing duration class directly, is to go ahead and let it use the datetime, then timeofday to get the time portion...
Picking up from where we left off with the import object above but fixing up the update and log time variables, we do the following --
opt=setvartype(opt,{'updatetime','logTime'},'datetime');
opt=setvaropts(opt,{'updatetime','logTime'},"InputFormat",'HH:mm:ss.SSSSSS');
tOI=readtable('OI401.csv',opt);
tOI.updatetime=timeofday(tOI.updatetime);
tOI.logTime=timeofday(tOI.logTime);
I'm not sure there is a way to do the conversion in place on the two datetime variables together with higher-level addressing; all the other forms I could think of didn't like changing the class of the datetime to a duration.
Either way, there are at least two solutions although the easiest if possible is to upgrade to a later release that understands the duration type on import/reading of the table to begin with.
Frank
on 3 Feb 2023
dpb
on 3 Feb 2023
Well, of course simply moving bytes from one location to another is far faster than do all the loading, data manipulations behind the scenes, and then rewriting it all back out again....that's obviously the "'round Robin Hood's barn" way to go at it for the purpose.
Frank
on 6 Feb 2023
Frank
on 6 Feb 2023
dpb
on 6 Feb 2023
I think you can accept more than one Answer if more than one were useful??? Not positive it will keep both; haven't explored that much; I do know that reputation points aren't taken away from an earlier respondent if a second is selected...
It's not like we get anything other than personal satisfaction out of it, anyways... :) My penchant is to try to teach as well as "just answer"; hence the examples of some less-oftened facilities I notice newcomers tend to overlook/not use...
Frank
on 9 Feb 2023
Categories
Find more on Data Type Conversion in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!