Main Content

Fault Detection and Remaining Useful Life Estimation Using Categorical Data

Machine data collected using various sensors during their run to failure includes information like the manufacturers' code, location of machines, or the experience level of the people handling the machine. This information can be used to improve the accuracy of predicting faulty machines. These variables are represented as categorical variables and can be used as predictors along with other measured sensor data to help identify which machine will need maintenance. This example illustrates how to perform fault classification and remaining useful life estimation using the categorical variables, such as team and machine provider as features in this data set.

Data Set

The data set [1] contains sensor records of 999 machines made by four different providers with slight variation among their models. The sensors were used by three different teams over a certain period. Note that this is a simulated data set. In total, there are seven variables per machine:

  1. Lifetime (Numeric): Number of weeks the machine has been active

  2. Broken (Boolean): Machine status

  3. PressureInd (Numeric): Pressure index. A sudden drop can indicate a leak

  4. MoistureInd (Numeric): Moisture index (relative humidity). Excessive humidity can create mold and damage the equipment.

  5. TemperatureInd (Numeric): Temperature index

  6. Team (Categorical): Team using the machine, represented as a string

  7. Provider (Categorical): Machine manufacturer name, represented as a string.

The strings in the Team and Provider data represent categorical variables that contain non-numeric data. In general, categorical variables have the following forms:

  • String or char data types: Often used for nominal categorical variables, where the value does not have any ranking or order

  • Integer or enumerated data types: Often used for ordinal categorical variables, where the values have a natural order or ranking

  • Boolean data type: Can only takes two values, True or False

In addition, MATLAB® provides a special data type, categorical, that can be used for computations that are designed specifically for categorical data. The categorical command converts an array of values to a categorical array.

Load the data.


Plot histograms of the variables in the data set to check how variables are distributed. The histograms help to understand the distribution of values and identify outliers or unusual patterns within the data set. It shows that the data in pressureInd, moistureInd, and temperatureInd is normally distributed while both the categorical variables team and provider are well balanced.

figure; tiledlayout(1,3)
nexttile; histogram(simulatedData.pressureInd); title('Pressure Index');
nexttile; histogram(simulatedData.moistureInd); title('Moisture Index');
nexttile; histogram(simulatedData.temperatureInd); title('Temperature Index');

Figure contains 3 axes objects. Axes object 1 with title Pressure Index contains an object of type histogram. Axes object 2 with title Moisture Index contains an object of type histogram. Axes object 3 with title Temperature Index contains an object of type histogram.

Create histograms of the categorical variables.

figure; tiledlayout(1,2)
nexttile; histogram(categorical(; title('Team Name');  
nexttile; histogram(categorical(simulatedData.provider)); title('Machine Manufacturer');

Figure contains 2 axes objects. Axes object 1 with title Team Name contains an object of type categoricalhistogram. Axes object 2 with title Machine Manufacturer contains an object of type categoricalhistogram.

The next step is to convert the categorical variables into a format where they can be used by the machine learning model.

Prepare Categorical Variables

To use categorical variables as predictors in machine learning models, convert them to numeric representations. The categorical variables in the data set have a data type of string. MATLAB provides a special data type, categorical, that you can use for computations that are designed specifically for categorical data. The categorical command converts an array of values to a categorical array.

Once you have converted the strings to categorical arrays, you can convert the arrays into a set of binary variables. The software uses a one-hot encoding technique to perform the conversion, with one variable for each category. This format allows the model to treat each category as a separate input. For more information about categorical variables and operations that can be performed on them, see Dummy Variables.

Use the dummyvar function to convert the values in the team and provider variables to numerical representation via one-hot encoding. Add the encoded variables to the rest of the variables in a table.

opTeam = categorical(;
opTeamEncoded = dummyvar(opTeam);
operatingTeam = array2table(opTeamEncoded,'VariableNames',categories(opTeam));

providers = categorical(simulatedData.provider);
providersEncoded = dummyvar(providers);
providerNames = array2table(providersEncoded,'VariableNames',categories(providers));

dataTable = [simulatedData(:,{'lifetime','broken','pressureInd','moistureInd','temperatureInd'}), operatingTeam, providerNames];
    lifetime    broken    pressureInd    moistureInd    temperatureInd    TeamA    TeamB    TeamC    Provider1    Provider2    Provider3    Provider4
    ________    ______    ___________    ___________    ______________    _____    _____    _____    _________    _________    _________    _________

       56         0         92.179         104.23           96.517          1        0        0          0            0            0            1    
       81         1         72.076         103.07           87.271          0        0        1          0            0            0            1    
       60         0         96.272         77.801            112.2          1        0        0          1            0            0            0    
       86         1         94.406         108.49           72.025          0        0        1          0            1            0            0    
       34         0         97.753         99.413           103.76          0        1        0          1            0            0            0    
       30         0         87.679         115.71           89.792          1        0        0          1            0            0            0    
       68         0         94.614         85.702           142.83          0        1        0          0            1            0            0    
       65         1         96.483         93.047           98.316          0        1        0          0            0            1            0    

Partition Dataset into Training Set and Testing Set

Partitioning the data set into subsets is essential in machine learning and model evaluation to prevent overfitting. Partitioning can be accomplished through methods such as holdout or k-fold cross validation. Use 20% holdout in the cvpartition function to divide the data set into a separate training set and the testing set. In practice, the choice of the holdout proportion can vary. A common practice is to use around 70-80% of the data for training and the remaining 20-30% for validation. However, these percentages can be adjusted based on the specific characteristics of the data set and problem domain.

Alternatively, kfold can be used instead of holdout in the cvpartition but holdout is used to keep a testing data set for later which is unseen by the model.

After partitioning, separate out the predictors and response columns from both the training and testing sets.

rng('default')    %For reproducibility

partition = cvpartition(size(dataTable,1),'Holdout',0.20); 
trainIndices = training(partition); 
testIndices =  test(partition); 
TrainData = dataTable(trainIndices,:);    %Training set
TestData = dataTable(testIndices,:);    %Testing set
Xtrain = TrainData(:,~strcmpi(TrainData.Properties.VariableNames, 'broken'));    %Predictors data from training set
Ytrain = TrainData(:,'broken');     %Response data from training set
Xtest = TestData(:,~strcmpi(TrainData.Properties.VariableNames, 'broken'));    %Predictors data from testing set
Ytest = TestData(:,'broken');    %Response data from testing set

Train Model

To choose a machine learning model, there are several options, such as fitctree, fitcsvm, and fitcknn. In this example, the fitctree function is used to create a binary classification tree from the training data in Xtrain and corresponding responses in Ytrain. This model is chosen because of its efficiency and interpretability.

treeMdl = fitctree(Xtrain,Ytrain); 

Typically, to better assess the performance and generalization ability of a model on unseen data, cross-validation can be applied. In cross validation, the data is partitioned into subsets then the model is trained on one subset, and its performance is evaluated on the remaining subset. This process is repeated multiple times to obtain reliable performance estimates.

Create a partitioned model partitionedModel. It is common to compute the 5-fold cross-validation misclassification error to strike a balance between variance reduction and computational efficiency. By default, crossval ensures that the class proportions in each fold remain approximately the same as the class proportions in the response variable Ytrain.

partitionedModel = crossval(treeMdl,'KFold',5); 
validationAccuracy = 1-kfoldLoss(partitionedModel) 
validationAccuracy = 0.9675


The loss function is used to evaluate the performance of the decision tree model. It quantifies the discrepancy between the predicted outputs of the model and the true target values in the training data. The mdlError represents total error percentage in the testing set. Subtracting it from 1 will give the accuracy.

The goal is to minimize the error, indicating better model performance.

mdlError = loss(treeMdl,Xtest,Ytest) 
mdlError = 0.0348
testAccuracyWithCategoricalVars = 1-mdlError
testAccuracyWithCategoricalVars = 0.9652

Importance of Categorical variables

To understand the difference in the performance of the classification model with and without the categorical variables, repeat the above steps to train another classification decision tree model without using categorical variables as features. Compare the accuracies of both models:

Xtrain_ = TrainData(:,{'lifetime','pressureInd','moistureInd','temperatureInd'});    %No categorical variables used
Ytrain_ = TrainData(:,{'broken'}); 
Xtest_ = TestData(:,{'lifetime','pressureInd','moistureInd','temperatureInd'});    %No categorical variables used
Ytest_ = TestData(:,{'broken'}); 
treeMdl_NoCatVars = fitctree(Xtrain_,Ytrain_);    %Training

partitionedModel_NoCategorical = crossval(treeMdl_NoCatVars,'KFold',5);    %Validation
validationAccuracy_NoCategorical = 1-kfoldLoss(partitionedModel_NoCategorical)    %Validation
validationAccuracy_NoCategorical = 0.9238
testAccuracyWithoutCategoricalVars = 1-loss(treeMdl_NoCatVars,Xtest_,Ytest_)    %Testing
testAccuracyWithoutCategoricalVars = 0.9312

It is observed that the performance dropped from 96.5% accuracy to 93% accuracy when the categorical variables were ignored. This suggests that, in this scenario, including categorical variables contributed to an increase in accuracy and a better performance.

Fit Covariate Survival Model to Data

In this section, fit a covariate survival model to the data set to predict the remaining useful life (RUL) of a machine. Covariate survival models are useful when the only data are the failure times and associated covariates for an ensemble of similar components, such as multiple machines manufactured to the same specifications. Covariates are environmental or explanatory variables, such as the component manufacturer or operating conditions. Assuming that the broken status of a machine indicates end of life, a covariateSurvivalModel estimates the remaining useful life (RUL) of a component using a proportional hazard survival model. Note that for this case, the non-numeric data related to team and provider names can be used directly without performing additional encoding. The model encodes them using the specified option, one-hot encoding in this case.

clearvars -except simulatedData

mdl = covariateSurvivalModel('LifeTimeVariable',"lifetime", 'LifeTimeUnit',"days", ...
    'DataVariables',["pressureInd","moistureInd","temperatureInd", "team", "provider"], ...
    'EncodedVariables', ["team", "provider"], "censorVariable", "broken");

mdl.EncodingMethod = 'binary';

Split simulatedData into fitting data and test data. Define test data as rows 4 and 5 in the simulatedData table.

Ctrain = simulatedData;
Ctrain(4:5,:) = [];

Ctest = simulatedData(4:5, ~strcmpi(simulatedData.Properties.VariableNames, 'broken'))
Ctest=2×6 table
    lifetime    pressureInd    moistureInd    temperatureInd      team         provider   
    ________    ___________    ___________    ______________    _________    _____________

       86         94.406         108.49           72.025        {'TeamC'}    {'Provider2'}
       34         97.753         99.413           103.76        {'TeamB'}    {'Provider1'}

Fit the covariate survival model with the training data.

fit(mdl, Ctrain)    
Successful convergence: Norm of gradient less than OPTIONS.TolFun

Once the model is fit, verify the test data. The test data set response for row 4 is 'broken'. For row 5 it is 'not broken'.

predictRUL(mdl, Ctest(1,:))
ans = duration
   -44.405 days

predictRUL(mdl, Ctest(2,:))
ans = duration
   10.997 days

The output of the predictRUL function is in days for this example, indicating the estimated remaining useful life of the machines. Positive days indicates estimated number of days to failure and negative days indicates that the machine is past its estimated end-of-life time. Therefore, the model is able to estimate the RUL successfully for both of the test data points. Note that the data set used in this example is not very large. Using a larger data set for training will make the resulting model more robust and will therefore improve prediction accuracy.


[1] Dataset created by

See Also

| | | | | | |

Related Topics