Statistics and Machine Learning Toolbox
Analyze and model data using statistics and machine learning
Statistics and Machine Learning Toolbox™ provides functions and apps to describe, analyze, and model data. You can use descriptive statistics, visualizations, and clustering for exploratory data analysis; fit probability distributions to data; generate random numbers for Monte Carlo simulations, and perform hypothesis tests. Regression and classification algorithms let you draw inferences from data and build predictive models either interactively, using the Classification and Regression Learner apps, or programmatically, using AutoML.
For multidimensional data analysis and feature extraction, the toolbox provides principal component analysis (PCA), regularization, dimensionality reduction, and feature selection methods that let you identify variables with the best predictive power.
The toolbox provides supervised, semi-supervised, and unsupervised machine learning algorithms, including support vector machines (SVMs), boosted decision trees, k-means, and other clustering methods. You can apply interpretability techniques such as partial dependence plots and LIME, and automatically generate C/C++ code for embedded deployment. Many toolbox algorithms can be used on data sets that are too big to be stored in memory.
Get Started:
Visualizations
Visually explore data using probability plots, box plots, histograms, quantile-quantile plots, and advanced plots for multivariate analysis, such as dendrograms, biplots, and Andrews plots.
Descriptive Statistics
Understand and describe potentially large sets of data quickly using a few highly relevant numbers.
Cluster Analysis
Discover patterns by grouping data using k-means, k-medoids, DBSCAN, hierarchical and spectral clustering, and Gaussian mixture and hidden Markov models.
Feature Extraction
Extract features from data using unsupervised learning techniques such as sparse filtering and reconstruction ICA. You can also use specialized techniques to extract features from images, signals, text, and numeric data.
Feature Selection
Automatically identify the subset of features that provide the best predictive power in modeling the data. Feature selection methods include stepwise regression, sequential feature selection, regularization, and ensemble methods.
Feature Transformation and Dimensionality Reduction
Reduce dimensionality by transforming existing (non-categorical) features into new predictor variables where less descriptive features can be dropped. Feature transformation methods include PCA, factor analysis, and nonnegative matrix factorization.
Train, Validate, and Tune Predictive Models
Compare various machine learning algorithms – including shallow neural networks, select features, adjust hyperparameters, and evaluate the performance of many popular classification and regression algorithms. Build and automatically optimize predictive models with interactive apps, and incrementally improve models with streaming data. Reduce need for labelled data by applying semi-supervised learning.
Model Interpretability
Enhance the interpretability of black-box machine learning models by using inherently interpretable models like generative additive models (GAM), or by applying established interpretability methods including partial dependence plots, individual conditional expectations (ICE), local interpretable model-agnostic explanations (LIME), and Shapley Values.
Automated Machine Learning (AutoML)
Improve model performance by automatically tuning hyperparameters, generating and selecting features and models, and addressing data set imbalances with cost matrices.
Linear and Nonlinear Regression
Model behavior of complex systems with multiple predictors or response variables choosing from many linear and nonlinear regression algorithms. Fit multilevel or hierarchical, linear, nonlinear, and generalized linear mixed-effects models with nested and/or crossed random effects to perform longitudinal or panel analyses, repeated measures, and growth modeling.
Nonparametric Regression
Generate an accurate fit without specifying a model that describes the relationship between predictors and response using SVMs, random forests, shallow neural networks, Gaussian processes, and Gaussian kernels.
Analysis of Variance (ANOVA)
Assign sample variance to different sources and determine whether the variation arises within or among different population groups. Use one-way, two-way, multiway, multivariate, and nonparametric ANOVA, as well as analysis of covariance (ANOCOVA) and repeated measures analysis of variance (RANOVA).
Probability Distributions
Fit continuous and discrete distributions, use statistical plots to evaluate goodness-of-fit, and compute probability density functions and cumulative distribution functions for more than 40 different distributions.
Random Number Generation
Generate pseudorandom and quasi-random number streams from either a fitted or a constructed probability distribution.
Hypothesis Testing
Perform t-tests, distribution tests (Chi-square, Jarque-Bera, Lilliefors, and Kolmogorov-Smirnov), and nonparametric tests for one, paired, or independent samples. Test for autocorrection and randomness, and compare distributions (two-sample Kolmogorov-Smirnov).
Design of Experiments (DOE)
Define, analyze, and visualize a customized DOE. Create and test practical plans for how to manipulate data inputs in tandem to generate information about their effects on data outputs.
Statistical Process Control (SPC)
Monitor and improve products or processes by evaluating process variability. Create control charts, estimate process capability, and perform gage repeatability and reproducibility studies.
Reliability and Survival Analysis
Visualize and analyze time-to-failure data with and without censoring by performing Cox proportional hazards regression and fit distributions. Compute empirical hazard, survivor, and cumulative distribution functions, as well as kernel density estimates.
Analyze Big Data with Tall Arrays
Use tall arrays and tables with many classification, regression, and clustering algorithms to train models on data sets that do not fit in memory without changing your code.
Parallel Computation
Speed up statistical computations and model training with parallelization.
Cloud and Distributed Computing
Use cloud instances to speed up statistical and machine learning computations. Perform the complete machine learning workflow in MATLAB Online™.
Code Generation
Generate portable and readable C or C++ code for inference of classification and regression algorithms, descriptive statistics, and probability distributions using MATLAB Coder™. Generate C/C++ prediction code with reduced precision using Fixed Point Designer™, and update parameters of deployed models without regenerating the prediction code.
Integration with Simulink
Integrate machine learning models with Simulink models for deployment to embedded hardware or for system simulation, verification, and validation.
Integrate with Applications and Enterprise Systems
Deploy statistical and machine learning models as standalone, MapReduce, or Spark™ applications; as web apps; or as Microsoft® Excel® add-ins using MATLAB Compiler™. Build C/C++ shared libraries, Microsoft .NET assemblies, Java® classes, and Python® packages using MATLAB Compiler SDK™.