Fault Detection and Remaining Useful Life Estimation Using Categorical Data

Open Live Script

This example shows how to create models for fault classification and remaining useful life (RUL) estimation using categorical machine data. Categorical data is data that has values in a finite set of discrete categories. For machine data, categorical variables can be the manufacturer's code, location of the machine, and experience level of the operators. You can use these variables as predictors, along with other measured sensor data, to help identify which machines will need maintenance.

Here, you use categorical variables to train a binary decision tree model that classifies if machines are broken. Then you fit a covariate survival model to the data to predict RUL.

Data Set

The data set [1] contains simulated sensor records of 999 machines made by four different providers with slight variations among their models. The simulated machines were used by three different teams over the simulation timespan. In total, the data set contains seven variables per machine:

Lifetime (numeric) — Number of weeks the machine has been active
Broken (boolean) — Machine status, where true indicates a broken machine.
PressureInd (numeric) — Pressure index. A sudden drop can indicate a leak
MoistureInd (numeric) — Moisture index (relative humidity). Excessive humidity can create mold and damage the equipment.
TemperatureInd (numeric) — Temperature index
Team (categorical) — Team using the machine, represented as a string
Provider (categorical) — Machine manufacturer name, represented as a string

The strings in the Team and Provider data represent categorical variables that contain nonnumeric data. Typically, categorical variables have the following forms:

String or character data types: Can be used for nominal categorical variables, where the value does not have any ranking or order
Integer or enumerated data types: Can be used for ordinal categorical variables, where the values have a natural order or ranking
Boolean data type: Can be used for categorical variables with only two values

In addition, MATLAB® provides a special data type, categorical, that you can be use for machine-learning computations. The categorical command converts an array of values to a categorical array.

Load the data.

load('simulatedData.mat');

Plot histograms of the numeric machine variables in the data set to visualize the variable distribution. The histograms help you assess the distribution of the values and identify outliers or unusual patterns within the data set. These histograms show that the data in pressureInd, moistureInd, and temperatureInd is normally distributed.

figure; 
tiledlayout(1,3)
%
nexttile; 
histogram(simulatedData.pressureInd); 
title('Pressure Index');
%
nexttile; 
histogram(simulatedData.moistureInd); 
title('Moisture Index');
%
nexttile;
histogram(simulatedData.temperatureInd); 
title('Temperature Index');

Create histograms of the categorical variables. The categorical values are well balanced.

figure; 
tiledlayout(1,2)
%
nexttile; 
histogram(categorical(simulatedData.team)); 
title('Team Name'); 
%
nexttile; 
histogram(categorical(simulatedData.provider)); 
title('Machine Manufacturer');

Prepare Categorical Variables

To use categorical variables as predictors in machine learning models, convert them to numeric representations. The categorical variables in the data set have a data type of string. MATLAB provides a special data type, categorical, that you can use for computations that are designed specifically for categorical data. The categorical command converts an array of values to a categorical array.

Once you have converted the strings to categorical arrays, you can convert the arrays into a set of binary variables. The software uses a one-hot encoding technique to perform the conversion, with one variable for each category. This format allows the model to treat each category as a separate input. For more information about categorical variables and operations that can be performed on them, see Dummy Variables.

Use the dummyvar function to convert the values in the team and provider variables to numeric representation. Create a table that includes the original Boolean and numeric variables and the newly encoded variables.

opTeam = categorical(simulatedData.team);
opTeamEncoded = dummyvar(opTeam);
operatingTeam = array2table(opTeamEncoded,'VariableNames',categories(opTeam));

providers = categorical(simulatedData.provider);
providersEncoded = dummyvar(providers);
providerNames = array2table(providersEncoded,'VariableNames',categories(providers));

dataTable = [simulatedData(:,{'lifetime','broken','pressureInd','moistureInd','temperatureInd'}), operatingTeam, providerNames];
head(dataTable)

    lifetime    broken    pressureInd    moistureInd    temperatureInd    TeamA    TeamB    TeamC    Provider1    Provider2    Provider3    Provider4
    ________    ______    ___________    ___________    ______________    _____    _____    _____    _________    _________    _________    _________

       56         0         92.179         104.23           96.517          1        0        0          0            0            0            1    
       81         1         72.076         103.07           87.271          0        0        1          0            0            0            1    
       60         0         96.272         77.801            112.2          1        0        0          1            0            0            0    
       86         1         94.406         108.49           72.025          0        0        1          0            1            0            0    
       34         0         97.753         99.413           103.76          0        1        0          1            0            0            0    
       30         0         87.679         115.71           89.792          1        0        0          1            0            0            0    
       68         0         94.614         85.702           142.83          0        1        0          0            1            0            0    
       65         1         96.483         93.047           98.316          0        1        0          0            0            1            0

Partition Data set into Training Set and Testing Set

To prevent overfitting of your model, you can partition your data to use a subset for training the model and another subset for testing it afterwards. One common partitioning method is to hold out 20%–30% for testing, leaving 70%–80% to train. You can adjust these percentages based on the specific characteristics of the data set and problem. For more information on other partitioning methods, see What is Cross-Validation?

Here, partition your dataset using cvpartition with a Holdout of 2.0.

An alternative approach is to use kfold instead of holdout for cvpartition, but then use holdout to reserve a testing data set to validate the model after training.

After partitioning, use the indices that cvpartition returns to extract training and testing data sets (trainData and testData). For reproducibility of the example results, first initialize the random number generator rng.

rng('default')

partition = cvpartition(size(dataTable,1),'Holdout',0.20); 
trainIndices = training(partition); 
testIndices =  test(partition); 
 
trainData = dataTable(trainIndices,:);
testData = dataTable(testIndices,:);

Extract the predictor and response columns from the data sets. The predictor sets are Xtrain and Xtest. The response sets are Ytrain and Ytest. Use ~strcmpi to exclude the 'broken' information from the predictor sets.

Xtrain = trainData(:,~strcmpi(trainData.Properties.VariableNames, 'broken'));
Ytrain = trainData(:,'broken');
 
Xtest = testData(:,~strcmpi(trainData.Properties.VariableNames, 'broken'));
Ytest = testData(:,'broken');

Train Model

To choose a machine learning model, there are several options, such as fitctree, fitcsvm, and fitcknn. In this example, use the fitctree function to create a binary classification tree from the training data in Xtrain and corresponding responses in Ytrain. This model is chosen because of its efficiency and interpretability.

treeMdl = fitctree(Xtrain,Ytrain);

Typically, to better assess the performance and generalization ability of a model on unseen data, cross-validation can be applied. In cross validation, the data is first partitioned into subsets. Then, the model is trained on one subset, and its performance is evaluated on the remaining subset. This process is repeated multiple times to obtain reliable performance estimates.

Create a partitioned model partitionedModel. It is common to compute the 5-fold cross-validation misclassification error to strike a balance between variance reduction and computational load. By default, crossval ensures that the class proportions in each fold remain approximately the same as the class proportions in the response variable Ytrain.

partitionedModel = crossval(treeMdl,'KFold',5); 
validationAccuracy = 1-kfoldLoss(partitionedModel)

validationAccuracy = 0.9675

Testing

Use the loss function to evaluate the performance of the decision tree model. This function quantifies the discrepancy between the predicted outputs of the model and the true target values in the training data. The model error, mdlError, that loss returns represents the total error percentage in the testing set. Subtracting mdlError from 1 provides the accuracy.

The goal is to minimize the error, indicating better model performance.

mdlError = loss(treeMdl,Xtest,Ytest)

mdlError = 0.0348

testAccuracyWithCategoricalVars = 1-mdlError

testAccuracyWithCategoricalVars = 0.9652

Impact of Using Categorical variables

To assess the difference in the performance of the classification model with and without the categorical variables, repeat the previous steps to train another classification decision tree model without using categorical variables as features.

Xtrain_ = trainData(:,{'lifetime','pressureInd','moistureInd','temperatureInd'});
Ytrain_ = trainData(:,{'broken'}); 
 
Xtest_ = testData(:,{'lifetime','pressureInd','moistureInd','temperatureInd'});
Ytest_ = testData(:,{'broken'}); 
 
treeMdl_NoCatVars = fitctree(Xtrain_,Ytrain_);    %Training

partitionedModel_NoCategorical = crossval(treeMdl_NoCatVars,'KFold',5);    %Validation
validationAccuracy_NoCategorical = 1-kfoldLoss(partitionedModel_NoCategorical)    %Validation

validationAccuracy_NoCategorical = 0.9238

testAccuracyWithoutCategoricalVars = 1-loss(treeMdl_NoCatVars,Xtest_,Ytest_)    %Testing

testAccuracyWithoutCategoricalVars = 0.9312

The performance drops from over 96% accuracy to around 93% accuracy when the predictors exclude the categorical data. This result suggests that, in this scenario, including categorical variables contribute to an increase in accuracy and to better performance.

Fit Covariate Survival Model to Data

In this section, fit a covariate survival model to the data set to predict the remaining useful life (RUL) of a machine. Covariate survival models are useful when the only data are the failure times and associated covariates for an ensemble of similar components, such as multiple machines manufactured to the same specifications. Covariates are environmental or explanatory variables, such as the component manufacturer or operating conditions. Assuming that the broken status of a machine indicates end of life, a covariateSurvivalModel estimates the remaining useful life (RUL) of a component using a proportional hazard survival model. Note that for this case, the nonnumeric data related to team and provider names can be used directly without performing additional encoding. The model encodes them using the specified option, one-hot encoding in this case.

clearvars -except simulatedData

mdl = covariateSurvivalModel('LifeTimeVariable',"lifetime", 'LifeTimeUnit',"days", ...
    'DataVariables',["pressureInd","moistureInd","temperatureInd", "team", "provider"], ...
    'EncodedVariables', ["team", "provider"], "censorVariable", "broken");

mdl.EncodingMethod = 'binary';

Split simulatedData into fitting data and test data. Define test data as rows 4 and 5 in the simulatedData table.

Ctrain = simulatedData;
Ctrain(4:5,:) = [];

Ctest = simulatedData(4:5, ~strcmpi(simulatedData.Properties.VariableNames, 'broken'))

Ctest=2×6 table
    lifetime    pressureInd    moistureInd    temperatureInd      team         provider   
    ________    ___________    ___________    ______________    _________    _____________

       86         94.406         108.49           72.025        {'TeamC'}    {'Provider2'}
       34         97.753         99.413           103.76        {'TeamB'}    {'Provider1'}

Fit the covariate survival model with the training data.

fit(mdl, Ctrain)

Successful convergence: Norm of gradient less than OPTIONS.TolFun

Once the model is fit, verify the test data. The test data set response for row 4 is 'broken'. For row 5 it is 'not broken'.

predictRUL(mdl, Ctest(1,:))

ans = duration
   -44.405 days

predictRUL(mdl, Ctest(2,:))

ans = duration
   10.997 days

The output of the predictRUL function is in days for this example, indicating the estimated remaining useful life of the machines. Positive days indicates estimated number of days to failure and negative days indicates that the machine is past its estimated end-of-life time. Therefore, the model is able to estimate the RUL successfully for both of the test data points. Note that the data set used in this example is not very large. Using a larger data set for training will make the resulting model more robust and will therefore improve prediction accuracy.

References

[1] Dataset created by http://walkerrowe.com/