fileEnsembleDatastore

Manage ensemble data in custom file format

Description

A fileEnsembleDatastore object is a datastore specialized for use in developing algorithms for condition monitoring and predictive maintenance using measured data.

An ensemble is a collection of member data stored in a collection of files. The fileEnsembleDatastore object specifies the data variables, independent variables, and condition variables in the ensemble. You provide functions that tell the fileEnsembleDatastore object how to read each type of variable from the collection of files. Therefore, you can use fileEnsembleDatastore to manage ensemble data stored in any file format or configuration of variables.

The data for a fileEnsembleDatastore object can be stored at any location supported by MATLAB® datastores, including remote locations, such as cloud storage using Amazon S3™ (Simple Storage Service), Windows Azure® Blob Storage, and Hadoop® Distributed File System (HDFS™).

For a detailed example illustrating the use of a file ensemble datastore, see File Ensemble Datastore With Measured Data. For general information about data ensembles in Predictive Maintenance Toolbox™, see Data Ensembles for Condition Monitoring and Predictive Maintenance.

Creation

Syntax

fensemble = fileEnsembleDatastore(location,extension)
fensemble = fileEnsembleDatastore(location,extension,Name,Value)

Description

example

fensemble = fileEnsembleDatastore(location,extension) creates a fileEnsembleDatastore object that points to data at the file path specified by location and having the specified file extension. Set properties of the object to specify the functions for reading from and writing to the ensemble datastore.

fensemble = fileEnsembleDatastore(location,extension,Name,Value) specifies additional properties of the object using one or more name-value pair arguments. For example, using 'ConditionVariables',["FaultCond";"ID"] specifies the condition variables when you create the object.

Input Arguments

expand all

Files or folders from which to read ensemble data, specified as a string, character vector, string array, or cell array of character vectors. If the files are not in the current folder, then location must contain full or relative paths.

If you specify a folder, then fileEnsembleDatastore uses all files in that folder with the extension specified by extension. Alternatively, specify an explicit list of files to include. You can also use the wildcard character (*) when specifying location. This character indicates that all matching files or all files in the matching folders are included in the datastore.

The file path can be any location supported by MATLAB datastores, including an IRI path pointing to a remote location, such as cloud storage using Amazon S3 (Simple Storage Service), Windows Azure Blob Storage, and Hadoop Distributed File System (HDFS). For more information about working with remote data in MATLAB, see Work with Remote Data (MATLAB).

Example: pwd + "\simResults"

Example: {'C:\dir\data\file1.xls','C:\dir\data\file2.xlsx'}

Example: "../dir/data/*.mat"

File extension for files in the datastore, specified as a string or a character vector, such as ".mat" or '.csv'.

If the datastore contains files having more than one extension, specify them as a string vector, such as [".xls",".xlsx"]. The functions that you supply for the ReadFcn and WriteToMemberFcn properties must be able to interact with all specified file types.

Properties

expand all

Function for reading all selected variables from the ensemble, specified as a handle to a function you provide. You write a function that instructs the software how to read variables from a data file containing a member of your ensemble. The function has:

  • Two inputs, a file name (string), and the names of signals (string vector) to load from the file

  • One output, a table row with table variables for each independent variable

When you specify ReadFcn, the software uses this function to read all selected variables from the ensemble, regardless of whether they are named in DataVariables, IndependentVariables, or ConditionVariables.

For example, suppose that you write the following function, readVars, for reading variables from your files. This function creates a table containing the variables in a data file that match those in the input string vector, variables.

function data = readVars(filename,variables)
data = table();
mfile = matfile(filename); % Allows partial loading
for ct=1:numel(variables)
    val = mfile.(variables{ct});
    if numel(val) > 1
        val = {val};
    end
    data.(variables{ct}) = val;
end
end

Save the function in a MATLAB file in the current folder or on the path. Then, if you create a fileEnsembleDatastore called fensemble, set ReadFcn as follows.

fensemble.ReadFcn = @readVars;

When you call read(fensemble), the software uses readVars to read all the variables in the SelectedVariables property of the ensemble datastore. You must set this property to read data from a fileEnsembleDatastore member. Otherwise, read generates an error.

Function for writing data to the last-read member of the ensemble, specified as a handle to a function you provide. You write a function that instructs the software how to write variables to a data file containing a member of your ensemble. The function has:

  • Two inputs, a file name (string), and a data structure whose field names are the data variables to write, and whose values are the corresponding values

  • No outputs

For example, suppose that you write the following function, writeNewData, for writing data to your files. This function writes an input data structure to the specified data file.

function writeNewData(filename,data)
save(filename, '-append', '-struct', 'structData');
end

Store writeNewData in a MATLAB file in the current folder or on the path. Then, if you create a fileEnsembleDatastore called fensemble, set writeToMemberFcn as follows:

fensemble.writeToMemberFcn = @writeNewData;

When you call the writeToLastMemberRead command on fensemble, the software uses writeNewData to add the new data to the data file of the last-read ensemble member. You must set this property to add data to a fileEnsembleDatastore member. Otherwise, writeToLastMemberRead generates an error.

Data variables in the ensemble, specified as a string array. Data variables are the main content of the members of an ensemble. Data variables can include measured data or derived data for analysis and development of predictive maintenance algorithms. For example, your data variables might include measured or simulated vibration signals and derived values such as mean vibration value or peak vibration frequency. In practice, your data variables, independent variables, and condition variables are all distinct sets of variables.

You can also specify DataVariables using a cell array of character vectors, such as {'Vibration';'Tacho'}, but the variable names are always stored as a string array, ["Vibration";"Tacho"]. If you specify a matrix of variable names, the matrix is flattened to a column vector.

Independent variables in the ensemble, specified as a string array. You typically use independent variables to order the members of an ensemble. Examples are timestamps, number of operating hours, or miles driven. Set this property to the names of such variables in your ensemble. In practice, your data variables, independent variables, and condition variables are all distinct sets of variables.

You can also specify IndependentVariables using a cell array of character vectors, such as {'Time';'Age'}, but the variable names are always stored as a string array, ["Time";"Age"]. If you specify a matrix of variable names, the matrix is flattened to a column vector.

Condition variables in the ensemble, specified as a string array. Use condition variables to label the members in a ensemble according to the fault condition or other operating condition under which the ensemble member was collected. In practice, your data variables, independent variables, and condition variables are all distinct sets of variables.

You can also specify ConditionVariables using a cell array of character vectors, such as {'GearFault';'Temperature'}, but the variable names are always stored as a string array, ["GearFault";"Temperature"]. If you specify a matrix of variable names, the matrix is flattened to a column vector.

Variables to read from the ensemble, specified as a string array. Use this property to specify which variables are extracted to the MATLAB workspace when you use the read command to read data from the current member ensemble. read returns a table row containing a table variable for each name specified in SelectedVariables. For example, suppose that you have an ensemble, fensemble, that contains six variables, and you want to read only two of them, Vibration and FaultState. Set the SelectedVariables property and call read:

fensemble.SelectedVariables = ["Vibration";"FaultState"];
data = read(fensemble)

SelectedVariables can be any combination of the variables in the DataVariables, ConditionVariables, and IndependentVariables properties. If SelectedVariables is empty, read generates an error.

You can specify SelectedVariables using a cell array of character vectors, such as {'Vibration';'Tacho'}, but the variable names are always stored as a string array, ["Vibration";"Tacho"]. If you specify a matrix of variable names, the matrix is flattened to a column vector.

Number of members to read from the ensemble datastore at once, specified as a positive integer that is smaller than the total number of members in the ensemble. By default, the read command returns a one-row table containing data from one ensemble member. To read data from multiple members in a single read operation, set this property to an integer value greater than one. For example, if ReadSize = 3, then read returns a three-row table where each row contains data from a different ensemble member. If fewer than ReadSize members are unread, then read returns a table with as many rows as there are remaining members.

The ensemble datastore property LastMemberRead contains the names of all files read during the most recent read operation. Thus, for instance, if ReadSize = 3, then a read operation sets LastMemberRead to a string vector containing three file names.

When you use writeToLastMemberRead, specify the data to write as a table with a number of rows equal to ReadSize. The writeToLastMemberRead command updates the members specified by LastMemberRead, writing one table row to each specified file.

Changing the ReadSize property also resets the ensemble to its unread state. For instance, suppose that you read some ensemble members one at a time (ReadSize = 1), and then change ReadSize to 3. The next read operation returns data from the first three ensemble members.

This property is read-only.

Number of members in the ensemble, specified as a positive integer.

This property is read-only.

File name of last ensemble member read into the MATLAB workspace, specified as a string. When you use the read command to read data from an ensemble datastore, the software determines which ensemble member to read next, and reads data from the corresponding file. The LastMemberRead property contains the path to the most recently read file. When the ensemble datastore has not yet been read, or has been reset, LastMemberRead is an empty string.

When you call writeToLastMemberRead to add data back to the ensemble datastore, that function writes to the file specified in LastMemberRead.

By default, read reads data from one ensemble member at a time (the ReadSize property of the ensemble datastore is 1). When ReadSize > 1, LastMemberRead is a string array containing the paths to all files read in the most recent read operation.

This property is read-only.

List of files in the ensemble datastore, specified as a column string vector of length NumMembers. Each entry contains the full path to a file in the datastore. The files are in the order in which the read command reads ensemble members.

Example: ["C:\Data\Data_01.csv"; "C:\Data\Data_02.csv"; "C:\Data\Data_03.csv"]

Object Functions

The read and writeToLastMemberRead functions are specialized for Predictive Maintenance Toolbox ensemble data. Other functions, such as reset and hasdata, are identical to those used with datastore objects in MATLAB. To partition an ensemble datastore, use the partition(ds,n,index) syntax of the partition function.

readRead member data from an ensemble datastore
writeToLastMemberReadWrite data to member of an ensemble datastore
resetReset datastore to initial state
hasdataDetermine if data is available to read
progress Determine how much data has been read
numpartitionsNumber of datastore partitions
partitionPartition a datastore
tallCreate tall array

Examples

collapse all

Create a file ensemble datastore for data stored in MATLAB® files, and configure it with functions that tell the software how to read from and write to the datastore.

For this example, you have two data files containing healthy operating data from a bearing system, baseline_01.mat and baseline_02.mat. You also have three data files containing faulty data from the same system, FaultData_01.mat, FaultData_02.mat, and FaultData_03.mat. (Because of the volume of data, the unzip operation takes several minutes.) In practice you might have many more data files.

unzip fileEnsData.zip  % extract compressed files
location = pwd;
extension = '.mat';
fensemble = fileEnsembleDatastore(location,extension);

Before you can interact with data in the ensemble, you must create functions that tell the software how to process the data files to read variables into the MATLAB workspace and to write data back to the files. Save these functions to a location on the file path. For this example, use the following supplied functions:

  • readBearingData — Extract requested variables from a structure, bearing, and other variables stored in the file. This function also parses the filename for the fault status of the data. The function returns a table row containing one table variable for each requested variable.

  • writeBearingData — Take a structure and write its variables to a data file as individual stored variables.

addpath(fullfile(matlabroot,'examples','predmaint','main')) % Make sure functions are on path

fensemble.ReadFcn = @readBearingData;
fensemble.WriteToMemberFcn = @writeBearingData;

Finally, set properties of the ensemble to identify data variables, condition variables, and selected variables for reading. For this example, the variables in the data file are gs, sr, load, and rate. Suppose that you only need to read the fault label, gs, and sr. Set these variables as the selected variables.

fensemble.DataVariables = ["gs";"sr";"load";"rate"];
fensemble.ConditionVariables = ["label"];
fensemble.SelectedVariables = ["label";"gs";"sr"];

Examine the ensemble. The functions and the variable names are assigned to the appropriate properties.

fensemble
fensemble = 
  fileEnsembleDatastore with properties:

                 ReadFcn: @readBearingData
        WriteToMemberFcn: @writeBearingData
           DataVariables: [4x1 string]
    IndependentVariables: [0x0 string]
      ConditionVariables: "label"
       SelectedVariables: [3x1 string]
                ReadSize: 1
              NumMembers: 5
          LastMemberRead: [0x0 string]
                   Files: [5x1 string]

These functions that you assigned tell the read and writeToLastMemberRead commands how to interact with the data files that make up the ensemble. For example, when you call the read command, it uses readBearingData to read all the variables in fensemble.SelectedVariables. For a more detailed example, see File Ensemble Datastore With Measured Data.

rmpath(fullfile(matlabroot,'examples','predmaint','main')) % Reset path

Create a file ensemble datastore for data stored in MATLAB files, and configure it with functions that tell the software how to read from and write to the datastore. (For more details about configuring file ensemble datastores, see File Ensemble Datastore With Measured Data.) Because of the volume of data, the unzip operation takes a few minutes.

% Create ensemble datastore that points to datafiles in current folder
unzip fileEnsData.zip  % extract compressed files
location = pwd;
extension = '.mat';
fensemble = fileEnsembleDatastore(location,extension);

% Specify data and condition variables
fensemble.DataVariables = ["gs";"sr";"load";"rate"];
fensemble.ConditionVariables = "label";

% Configure with functions for reading and writing variable data
addpath(fullfile(matlabroot,'examples','predmaint','main')) % Make sure functions are on path
fensemble.ReadFcn = @readBearingData;
fensemble.WriteToMemberFcn = @writeBearingData; 

The functions tell the read and writeToLastMemberRead commands how to interact with the data files that make up the ensemble. Thus, when you call the read command, it uses readBearingData to read all the variables in fensemble.SelectedVariables. For this example, readBearingData extracts requested variables from a structure, bearing, and other variables stored in the file. It also parses the filename for the fault status of the data.

Specify variables to read, and read them from the first member of the ensemble.

fensemble.SelectedVariables = ["gs";"load";"label"];
data = read(fensemble)
data=1×3 table
     label             gs            load
    ________    _________________    ____

    "Faulty"    [146484x1 double]     0  

You can now process the data from the member as needed. For this example, compute the average value of the signal stored in the variable gs. Extract the data from the table returned by read.

gsdata = data.gs{1};
gsmean = mean(gsdata);

You can write the mean value gsmean back to the data file as a new variable. To do so, first expand the list of data variables in the ensemble to include a variable for the new value. Call the new variable gsMean.

fensemble.DataVariables = [fensemble.DataVariables;"gsMean"]
fensemble = 
  fileEnsembleDatastore with properties:

                 ReadFcn: @readBearingData
        WriteToMemberFcn: @writeBearingData
           DataVariables: [5x1 string]
    IndependentVariables: [0x0 string]
      ConditionVariables: "label"
       SelectedVariables: [3x1 string]
                ReadSize: 1
              NumMembers: 5
          LastMemberRead: "/tmp/Bdoc19a_1099451_63947/tp955f982d/predmaint-ex34165887/FaultData_01.mat"
                   Files: [5x1 string]

Next, write the derived mean value to the file corresponding to the last-read ensemble member. (See Data Ensembles for Condition Monitoring and Predictive Maintenance.) When you call writeToLastMemberRead, it converts the data to a structure and calls fensemble.WriteToMemberFcn to write the data to the file.

writeToLastMemberRead(fensemble,'gsMean',gsmean);

Calling read again advances the last-read-member indicator to the next file in the ensemble and reads the data from that file.

data = read(fensemble)
data=1×3 table
     label             gs            load
    ________    _________________    ____

    "Faulty"    [146484x1 double]     50 

You can confirm that this data is from a different member by examining the load variable in the table. Here, its value is 50, while in the previously read member, it was 0.

You can repeat the processing steps to compute and append the mean for this ensemble member. In practice, it is more useful to automate the process of reading, processing, and writing data. To do so, reset the ensemble to a state in which no data has been read. Then loop through the ensemble and perform the read, process, and write steps for each member.

reset(fensemble)
while hasdata(fensemble)
    data = read(fensemble);
    gsdata = data.gs{1};
    gsmean = mean(gsdata);
    writeToLastMemberRead(fensemble,'gsMean',gsmean);
end

The hasdata command returns false when every member of the ensemble has been read. Now, each data file in the ensemble includes the gsMean variable derived from the data gs in that file. You can use techniques like this loop to extract and process data from your ensemble files as you develop a predictive-maintenance algorithm. For an example illustrating in more detail the use of a file ensemble datastore in the algorithm-development process, see Rolling Element Bearing Fault Diagnosis. The example also shows how to use Parallel Computing Toolbox™ to speed up the processing of large data ensembles.

To confirm that the derived variable is present in the file ensemble datastore, read it from the first and second ensemble members. To do so, reset the ensemble again, and add the new variable to the selected variables. In practice, after you have computed derived values, it can be useful to read only those values without rereading the unprocessed data, which can take significant space in memory. For this example, read selected variables that include the new variable, gsMean, but do not include the unprocessed data, gs.

reset(fensemble)
fensemble.SelectedVariables = ["label";"load";"gsMean"];
data1 = read(fensemble)
data1=1×3 table
     label      load    gsMean
    ________    ____    ______

    "Faulty"     0      -0.23 

data2 = read(fensemble)
data2=1×3 table
     label      load     gsMean 
    ________    ____    ________

    "Faulty"     50     -0.22352

rmpath(fullfile(matlabroot,'examples','predmaint','main')) % Reset path

Compatibility Considerations

expand all

Not recommended starting in R2018b

Introduced in R2018a