- Create a Datastore: Initialize a 'datastore' for your Parquet files.
- Define Custom Function: Create a function to compute the desired statistics for each chunk of data.
- Apply Transformation: Use the 'transform' function to apply your custom statistics calculation to the datastore.
- Read and Aggregate Statistics: Iterate over the datastore to read the statistics of each chunk and aggregate them globally.
- Use Statistics for Filtering: Leverage the aggregated statistics to filter and select relevant data segments.
Statistics of datastore of tabular data
2 views (last 30 days)
Show older comments
Hey all,
I have thousands of parquet files. Each file has more than 50,000 rows of numerical data with more than 100 columns each. My data can't fit in memory so I use datastores to import and handle the data for machine learning workflow downstream. I would like to know if it is possible to calculate some statistics (max, min, mean, std for each channel) of each file during the datastore creation process, which I can use afterwards to filter and select the relevant segments of data for my downstream analysis.
Thanks in advance
0 Comments
Accepted Answer
Abhas
on 26 Mar 2024
Hi Omar,
To calculate statistics (max, min, mean, std for each channel) during the datastore creation process in MATLAB and use them for filtering and selecting relevant data segments for downstream analysis, you can follow these steps:
Here's the MATLAB code to reflect the above steps:
% Step 1: Create Your Datastore
ds = parquetDatastore('path/to/your/parquet/files/*.parquet');
% Step 2: Define Your Custom Function
function statsTable = calculateStats(tbl)
statsTable = varfun(@min, tbl, 'OutputFormat', 'table');
statsTable.Properties.VariableNames = strcat(statsTable.Properties.VariableNames, '_min');
maxTable = varfun(@max, tbl, 'OutputFormat', 'table');
maxTable.Properties.VariableNames = strcat(maxTable.Properties.VariableNames, '_max');
statsTable = [statsTable, maxTable];
meanTable = varfun(@mean, tbl, 'OutputFormat', 'table');
meanTable.Properties.VariableNames = strcat(meanTable.Properties.VariableNames, '_mean');
statsTable = [statsTable, meanTable];
stdTable = varfun(@std, tbl, 'OutputFormat', 'table');
stdTable.Properties.VariableNames = strcat(stdTable.Properties.VariableNames, '_std');
statsTable = [statsTable, stdTable];
end
% Step 3: Apply the Transformation
ds = transform(ds, @calculateStats);
% Step 4: Read and Aggregate the Statistics
globalMin = inf; % Initialize for min. Do similarly for max, mean, std
while hasdata(ds)
statsChunk = read(ds);
chunkMin = min(table2array(statsChunk(:, contains(statsChunk.Properties.VariableNames, '_min'))), [], 'all');
globalMin = min(globalMin, chunkMin);
% Update global max, mean, std similarly
end
% At this point, globalMin (and other statistics) can be used for filtering and selecting relevant data segments
At this point, you have the aggregated statistics (e.g., globalMin) which you can use to filter and select relevant segments of your data for further analysis.
You may refer to the following documentation links to have a better understanding on working with datastore and transform in MATLAB:
More Answers (0)
See Also
Categories
Find more on Data Preprocessing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!