# Fault Detection and Remaining Useful Life Estimation Using Categorical Data

Machine data collected using various sensors during their run to failure includes information like the manufacturers' code, location of machines, or the experience level of the people handling the machine. This information can be used to improve the accuracy of predicting faulty machines. These variables are represented as *categorical* variables and can be used as predictors along with other measured sensor data to help identify which machine will need maintenance. This example illustrates how to perform fault classification and remaining useful life estimation using the categorical variables, such as `team`

and machine `provider`

as features in this data set.

### Data Set

The data set [1] contains sensor records of 999 machines made by four different providers with slight variation among their models. The sensors were used by three different teams over a certain period. Note that this is a simulated data set. In total, there are seven variables per machine:

`Lifetime`

(Numeric): Number of weeks the machine has been active`Broken`

(Boolean): Machine status`PressureInd`

(Numeric): Pressure index. A sudden drop can indicate a leak`MoistureInd`

(Numeric): Moisture index (relative humidity). Excessive humidity can create mold and damage the equipment.`TemperatureInd`

(Numeric): Temperature index`Team`

(Categorical): Team using the machine, represented as a string`Provider`

(Categorical): Machine manufacturer name, represented as a string.

The strings in the `Team`

and `Provider`

data represent categorical variables that contain non-numeric data. In general, categorical variables have the following forms:

String or char data types: Often used for nominal categorical variables, where the value does not have any ranking or order

Integer or enumerated data types: Often used for ordinal categorical variables, where the values have a natural order or ranking

Boolean data type: Can only takes two values, True or False

In addition, MATLAB® provides a special data type, `categorical`

, that can be used for computations that are designed specifically for categorical data. The `categorical`

command converts an array of values to a `categorical`

array.

Load the data.

`load('simulatedData.mat');`

Plot histograms of the variables in the data set to check how variables are distributed. The histograms help to understand the distribution of values and identify outliers or unusual patterns within the data set. It shows that the data in `pressureInd`

, `moistureInd`

, and `temperatureInd`

is normally distributed while both the categorical variables `team`

and `provider`

are well balanced.

figure; tiledlayout(1,3) nexttile; histogram(simulatedData.pressureInd); title('Pressure Index'); nexttile; histogram(simulatedData.moistureInd); title('Moisture Index'); nexttile; histogram(simulatedData.temperatureInd); title('Temperature Index');

Create histograms of the categorical variables.

figure; tiledlayout(1,2) nexttile; histogram(categorical(simulatedData.team)); title('Team Name'); nexttile; histogram(categorical(simulatedData.provider)); title('Machine Manufacturer');

The next step is to convert the categorical variables into a format where they can be used by the machine learning model.

### Prepare Categorical Variables

To use categorical variables as predictors in machine learning models, convert them to numeric representations. The categorical variables in the data set have a data type of `string`

. MATLAB provides a special data type, `categorical`

, that you can use for computations that are designed specifically for categorical data. The `categorical`

command converts an array of values to a `categorical`

array.

Once you have converted the strings to `categorical`

arrays, you can convert the arrays into a set of binary variables. The software uses a one-hot encoding technique to perform the conversion, with one variable for each category. This format allows the model to treat each category as a separate input. For more information about categorical variables and operations that can be performed on them, see Dummy Variables.

Use the `dummyvar`

function to convert the values in the `team`

and `provider`

variables to numerical representation via one-hot encoding. Add the encoded variables to the rest of the variables in a table.

opTeam = categorical(simulatedData.team); opTeamEncoded = dummyvar(opTeam); operatingTeam = array2table(opTeamEncoded,'VariableNames',categories(opTeam)); providers = categorical(simulatedData.provider); providersEncoded = dummyvar(providers); providerNames = array2table(providersEncoded,'VariableNames',categories(providers)); dataTable = [simulatedData(:,{'lifetime','broken','pressureInd','moistureInd','temperatureInd'}), operatingTeam, providerNames]; head(dataTable)

lifetime broken pressureInd moistureInd temperatureInd TeamA TeamB TeamC Provider1 Provider2 Provider3 Provider4 ________ ______ ___________ ___________ ______________ _____ _____ _____ _________ _________ _________ _________ 56 0 92.179 104.23 96.517 1 0 0 0 0 0 1 81 1 72.076 103.07 87.271 0 0 1 0 0 0 1 60 0 96.272 77.801 112.2 1 0 0 1 0 0 0 86 1 94.406 108.49 72.025 0 0 1 0 1 0 0 34 0 97.753 99.413 103.76 0 1 0 1 0 0 0 30 0 87.679 115.71 89.792 1 0 0 1 0 0 0 68 0 94.614 85.702 142.83 0 1 0 0 1 0 0 65 1 96.483 93.047 98.316 0 1 0 0 0 1 0

### Partition Dataset into Training Set and Testing Set

Partitioning the data set into subsets is essential in machine learning and model evaluation to prevent overfitting. Partitioning can be accomplished through methods such as holdout or k-fold cross validation. Use 20% holdout in the `cvpartition `

function to divide the data set into a separate training set and the testing set. In practice, the choice of the holdout proportion can vary. A common practice is to use around 70-80% of the data for training and the remaining 20-30% for validation. However, these percentages can be adjusted based on the specific characteristics of the data set and problem domain.

Alternatively, `kfold`

can be used instead of `holdout`

in the `cvpartition`

but `holdout`

is used to keep a testing data set for later which is unseen by the model.

After partitioning, separate out the predictors and response columns from both the training and testing sets.

rng('default') %For reproducibility partition = cvpartition(size(dataTable,1),'Holdout',0.20); trainIndices = training(partition); testIndices = test(partition); TrainData = dataTable(trainIndices,:); %Training set TestData = dataTable(testIndices,:); %Testing set Xtrain = TrainData(:,~strcmpi(TrainData.Properties.VariableNames, 'broken')); %Predictors data from training set Ytrain = TrainData(:,'broken'); %Response data from training set Xtest = TestData(:,~strcmpi(TrainData.Properties.VariableNames, 'broken')); %Predictors data from testing set Ytest = TestData(:,'broken'); %Response data from testing set

### Train Model

To choose a machine learning model, there are several options, such as `fitctree`

, `fitcsvm`

, and `fitcknn`

. In this example, the `fitctree`

function is used to create a binary classification tree from the training data in `Xtrain`

and corresponding responses in `Ytrain`

. This model is chosen because of its efficiency and interpretability.

treeMdl = fitctree(Xtrain,Ytrain);

Typically, to better assess the performance and generalization ability of a model on unseen data, cross-validation can be applied. In cross validation, the data is partitioned into subsets then the model is trained on one subset, and its performance is evaluated on the remaining subset. This process is repeated multiple times to obtain reliable performance estimates.

Create a partitioned model `partitionedModel`

. It is common to compute the 5-fold cross-validation misclassification error to strike a balance between variance reduction and computational efficiency. By default, `crossval`

ensures that the class proportions in each fold remain approximately the same as the class proportions in the response variable `Ytrain`

.

```
partitionedModel = crossval(treeMdl,'KFold',5);
validationAccuracy = 1-kfoldLoss(partitionedModel)
```

validationAccuracy = 0.9675

### Testing

The `loss`

function is used to evaluate the performance of the decision tree model. It quantifies the discrepancy between the predicted outputs of the model and the true target values in the training data. The `mdlError`

represents total error percentage in the testing set. Subtracting it from 1 will give the accuracy.

The goal is to minimize the error, indicating better model performance.

mdlError = loss(treeMdl,Xtest,Ytest)

mdlError = 0.0348

testAccuracyWithCategoricalVars = 1-mdlError

testAccuracyWithCategoricalVars = 0.9652

### Importance of Categorical variables

To understand the difference in the performance of the classification model with and without the categorical variables, repeat the above steps to train another classification decision tree model without using categorical variables as features. Compare the accuracies of both models:

Xtrain_ = TrainData(:,{'lifetime','pressureInd','moistureInd','temperatureInd'}); %No categorical variables used Ytrain_ = TrainData(:,{'broken'}); Xtest_ = TestData(:,{'lifetime','pressureInd','moistureInd','temperatureInd'}); %No categorical variables used Ytest_ = TestData(:,{'broken'}); treeMdl_NoCatVars = fitctree(Xtrain_,Ytrain_); %Training partitionedModel_NoCategorical = crossval(treeMdl_NoCatVars,'KFold',5); %Validation validationAccuracy_NoCategorical = 1-kfoldLoss(partitionedModel_NoCategorical) %Validation

validationAccuracy_NoCategorical = 0.9238

`testAccuracyWithoutCategoricalVars = 1-loss(treeMdl_NoCatVars,Xtest_,Ytest_) %Testing`

testAccuracyWithoutCategoricalVars = 0.9312

It is observed that the performance dropped from 96.5% accuracy to 93% accuracy when the categorical variables were ignored. This suggests that, in this scenario, including categorical variables contributed to an increase in accuracy and a better performance.

### Fit Covariate Survival Model to Data

In this section, fit a covariate survival model to the data set to predict the remaining useful life (RUL) of a machine. Covariate survival models are useful when the only data are the failure times and associated covariates for an ensemble of similar components, such as multiple machines manufactured to the same specifications. Covariates are environmental or explanatory variables, such as the component manufacturer or operating conditions. Assuming that the `broken`

status of a machine indicates end of life, a `covariateSurvivalModel`

estimates the remaining useful life (RUL) of a component using a proportional hazard survival model. Note that for this case, the non-numeric data related to `team`

and `provider`

names can be used directly without performing additional encoding. The model encodes them using the specified option, one-hot encoding in this case.

clearvars -except simulatedData mdl = covariateSurvivalModel('LifeTimeVariable',"lifetime", 'LifeTimeUnit',"days", ... 'DataVariables',["pressureInd","moistureInd","temperatureInd", "team", "provider"], ... 'EncodedVariables', ["team", "provider"], "censorVariable", "broken"); mdl.EncodingMethod = 'binary';

Split `simulatedData`

into fitting data and test data. Define test data as rows 4 and 5 in the `simulatedData`

table.

```
Ctrain = simulatedData;
Ctrain(4:5,:) = [];
Ctest = simulatedData(4:5, ~strcmpi(simulatedData.Properties.VariableNames, 'broken'))
```

`Ctest=`*2×6 table*
lifetime pressureInd moistureInd temperatureInd team provider
________ ___________ ___________ ______________ _________ _____________
86 94.406 108.49 72.025 {'TeamC'} {'Provider2'}
34 97.753 99.413 103.76 {'TeamB'} {'Provider1'}

Fit the covariate survival model with the training data.

fit(mdl, Ctrain)

Successful convergence: Norm of gradient less than OPTIONS.TolFun

Once the model is fit, verify the test data. The test data set response for row 4 is `'broken'`

. For row 5 it is `'not broken'`

.

predictRUL(mdl, Ctest(1,:))

`ans = `*duration*
-44.405 days

predictRUL(mdl, Ctest(2,:))

`ans = `*duration*
10.997 days

The output of the `predictRUL`

function is in days for this example, indicating the estimated remaining useful life of the machines. Positive days indicates estimated number of days to failure and negative days indicates that the machine is past its estimated end-of-life time. Therefore, the model is able to estimate the RUL successfully for both of the test data points. Note that the data set used in this example is not very large. Using a larger data set for training will make the resulting model more robust and will therefore improve prediction accuracy.

#### References

[1] Dataset created by http://walkerrowe.com/

## See Also

`categorical`

| `dummyvar`

| `predictRUL`

| `covariateSurvivalModel`

| `cvpartition`

| `fitctree`

| `fitcknn`

| `fitcsvm`