# Method to forecast categorical variable from numerous numerical predictors?

1 view (last 30 days)
Valkmi on 20 Jun 2017
Commented: Greg Heath on 22 Jun 2017
Hey!
After few years of using matlab I stumbled upon the mightiest of challenges I have yet faced. Suppose you have access to the following data ( Measurements happen every second, data from 1.1.2017 till 20.6.2017 ):
- Binary data (0 or 1), 0 for normal situation and 1 every time a failure occurs. Currently out of the millions of data points 171 failures have happened. Meaning that the vector consist mostly of 0's and just a few 1's
- Process data (temperature, speed, moisture etc.) from all the processes that I think might cause the failure during production
The problem here is to create a model, or an algorithm, that predicts when failure might happen, and why it happens. So far I have visualized the data to find correlation between failures and process data, removed obvious outliers and tried some feature selection algorithms such as sequentialfs. I Also tried creating some forecasting algorithms. All without any luck or success, and I think I know why:
- Too many process parameters to visually analyze thoroughly.
- The failure might be caused by changes in the past, for example temperature changes in the beginning of the process might cause failure ten seconds later in the end of the production process.
- The failure might be caused by combination of changes in the process parameters. For example moisture changes in the beginning of the process and twenty seconds later by increasing speed might cause failure.
- Knowledge of the process does not help, it is so complicated and the failure might be caused by any of the hundreds of process parameters and their combinations.
What would be the best method to start solving this problem? Naïve Bayes did not work, Neural networks are not for categorical predictions (as far as I know) and hard to interpret. The complexity of the process altogether makes this a very hard puzzle.
I can't share the data.
BR
Greg Heath on 22 Jun 2017
You are wrong:
The neural network function PATTERNNET is designed for multiclass classification with output targets that are columns of ones and zeros.
LOGSIG is the output function when the classes are not distinct ( output columns can have more than one "1")
and
SOFTMAX is the output function when the classes are distinct (output columns are those of the unit matrix).
See the documentation
help patternnet
doc patternnet
Hope this helps.
Thank you for formally accepting my answer
Greg

Ankita Nargundkar on 22 Jun 2017
This webinar might be a good place to start: link

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!