Documentation

ppca

Probabilistic principal component analysis

Description

example

[coeff,score,pcvar] = ppca(Y,K) returns the principal component coefficients for the n-by-p data matrix Y based on a probabilistic principal component analysis (PPCA). It also returns the principal component scores, which are the representations of Y in the principal component space, and the principal component variances, which are the eigenvalues of the covariance matrix of Y, in pcvar.

Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance. Rows of score correspond to observations, and columns correspond to components. Rows of Y correspond to observations and columns correspond to variables.

Probabilistic principal component analysis might be preferable to other algorithms that handle missing data, such as the alternating least squares algorithm when any data vector has one or more missing values. It assumes that the values are missing at random through the data set. An expectation-maximization algorithm is used for both complete and missing data.

example

[coeff,score,pcvar] = ppca(Y,K,Name,Value) returns the principal component coefficients, scores, and variances using additional options for computation and handling of special data types, specified by one or more Name,Value pair arguments.

For example, you can introduce initial values for the residual variance, v, or change the termination criteria.

example

[coeff,score,pcvar,mu] = ppca(___) also returns the estimated mean of each variable in Y. You can use any of the input arguments in the previous syntaxes.

example

[coeff,score,pcvar,mu,v,S] = ppca(___) also returns the isotropic residual variance in v and the final results at convergence in structure S.

Examples

collapse all

Load the sample data.

The double matrix meas consists of four types of measurements on the flowers, which, respectively, are the length and width of sepals and petals.

Introduce missing values randomly.

y = meas;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.20;
y(ix) = NaN;

Now, approximately 20% of the data is missing, indicated by NaN.

Perform probabilistic principal component analysis and request the component coefficients and variances.

[coeff,score,pcvar,mu] = ppca(y,3);
coeff
coeff = 4×3

0.3562    0.6709   -0.5518
-0.0765    0.7120    0.6332
0.8592   -0.1597    0.0596
0.3592   -0.1318    0.5395

pcvar
pcvar = 3×1

4.0914
0.2125
0.0617

Perform principal component analysis using the alternating least squares algorithm and request the component coefficients and variances.

[coeff2,score2,pcvar2,mu2] = pca(y,'algorithm','als',...
'NumComponents',3);
coeff2
coeff2 = 4×3

0.3376    0.4952    0.7406
-0.0731    0.8609   -0.4476
0.8657   -0.1168   -0.1233
0.3623   -0.0086   -0.4857

pcvar2
pcvar2 = 3×1

4.0733
0.2652
0.1222

The coefficients and the variances of the first two principal components are similar.

Another way to compare the results is to find the angle between the two spaces spanned by the coefficient vectors.

subspace(coeff,coeff2)
ans = 0.0884

The angle between the two spaces is pretty small. This indicates that these two results are close to each other.

Load the sample data set.

Data matrix X has 13 continuous variables in columns 3 to 15: wheel-base, length, width, height, curb-weight, engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, and highway-mpg. The variables bore and stroke are missing four values in rows 56 to 59, and the variables horsepower and peak-rpm are missing two values in rows 131 and 132.

Perform probabilistic principal component analysis and display the first three principal components.

[coeff,score,pcvar] = ppca(X(:,3:15),3);
Warning: Maximum number of iterations 1000 reached.

Change the termination tolerance for the cost function to 0.01.

opt = statset('ppca');
opt.TolFun = 0.01;

Perform probabilistic principal component analysis.

[coeff,score,pcvar] = ppca(X(:,3:15),3,'Options',opt);
Warning: Maximum number of iterations 1000 reached.

ppca now terminates before the maximum number of iterations is reached because it meets the tolerance for the cost function.

Load the sample data.

y = ingredients;

The ingredients data has 13 observations for 4 variables.

Introduce missing values to the data.

y(2:16:end) = NaN;

Every 16th value is NaN. This corresponds to 7.69% of the data.

Find the first three principal components of data using PPCA and display the reconstructed observations.

[coeff,score,pcvar,mu,v,S] = ppca(y,3);
Warning: Maximum number of iterations 1000 reached.
S.Recon
ans = 13×4

6.8536   25.8700    5.8389   59.8730
1.0433   28.9710   14.9654   51.9738
11.5770   56.5067    8.6352   20.5076
11.0835   31.0722    8.0920   47.0748
7.0679   52.2556    6.0748   33.0598
11.0486   55.0430    9.0534   22.0423
2.8493   70.8691   16.8339    5.8656
1.0333   31.0281   19.6907   44.0306
2.0400   54.0354   18.0440   22.0349
20.7822   46.8091    3.7603   25.8081
⋮

You can also reconstruct the observations using the principal components and the estimated mean.

t = score*coeff' + repmat(mu,13,1);

Here, ingredients is a real-valued matrix of predictor variables.

Perform the probabilistic principal components analysis and display coefficients.

[coeff,score,pcvariance,mu,v,S] = ppca(ingredients,3);
Warning: Maximum number of iterations 1000 reached.
coeff
coeff = 4×3

-0.0693   -0.6459    0.5673
-0.6786   -0.0184   -0.5440
0.0308    0.7552    0.4036
0.7306   -0.1102   -0.4684

Display the algorithm results at convergence of the PPCA.

S
S = struct with fields:
W: [4x3 double]
Xexp: [13x3 double]
Recon: [13x4 double]
v: 0.2372
NumIter: 1000
RMSResid: 0.2340
nloglk: 149.3388

Display the matrix W.

S.W
ans = 4×3

0.5624    2.0279    5.4075
4.8320  -10.3894    5.9202
-3.7521   -3.0555   -4.1552
-1.5144   11.7122   -7.2564

Orthogonalizing W recovers the coefficients.

orth(S.W)
ans = 4×3

-0.0693    0.6459    0.5673
-0.6786    0.0184   -0.5440
0.0308   -0.7552    0.4036
0.7306    0.1102   -0.4684

Input Arguments

collapse all

Input data for which to compute the principal components, specified as an n-by-p matrix. Rows of Y correspond to observations and columns correspond to variables.

Data Types: single | double

Number of principal components to return, specified as an integer value less than the rank of data. The maximum possible rank is min(n,p), where n is the number of observations and p is the number of variables. However, if the data is correlated, the rank might be smaller than min(n,p).

ppca orders the components based on their variance.

If K is min(n,p), ppca sets K equal to min(n,p) – 1, and 'W0' is truncated to min(p,n) – 1 columns if you specify a p-by-p W0 matrix.

For example, you can request only the first three components, based on the component variance as follows.

Example: coeff = ppca(Y,3)

Data Types: single | double

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'W0',init,'Options',opt specifies that the initial values for 'W0' are in matrix init and ppca uses the options defined by opt.

Initial value of W in the probabilistic principal component analysis algorithm, specified as a comma-separated pair consisting of 'W0' and a p-by-k matrix.

Data Types: single | double

Initial value of residual variance, specified as the comma-separated pair consisting of 'v0' and a positive scalar value.

Data Types: single | double

Options for the iterations, specified as a comma-separated pair 'Options' and a structure created by the statset function. ppca uses the following fields in the options structure.

 'Display' Level of display output. Choices are 'off', 'final', and 'iter'. 'MaxIter' Maximum number of steps allowed. The default is 1000. Unlike in optimization settings, reaching the MaxIter value is regarded as convergence. 'TolFun' Positive integer stating the termination tolerance for the cost function. The default is 1e-6. 'TolX' Positive integer stating the convergence threshold for the relative change in the elements of W. The default is 1e-6.

You can change the values of these fields and specify the new structure in ppca using the 'Options' name-value pair argument.

Example: opt = statset('ppca'); opt.MaxIter = 2000; coeff = ppca(Y,3,'Options',opt);

Data Types: struct

Output Arguments

collapse all

Principal component coefficients, returned as a p-by-k matrix. Each column of coeff contains coefficients for one principal component. The columns are in the order of descending component variance, pcvar.

Principal component scores, returned as an n-by-k matrix. Rows of score correspond to observations, and columns correspond to components.

Principal component variances, which are the eigenvalues of the covariance matrix of Y, returned as a column vector.

Estimated mean of each variable in Y, returned as a row vector.

Isotropic residual variance, returned as a scalar value.

Final results at convergence, returned as a structure containing the following fields.

 W W at convergence. Xexp Conditional expectation of the estimated latent variable x. Recon Reconstructed observations using k principal components. This is a low dimension approximation of the input data Y, and is equal to mu + score*coeff'. v Residual variance. RMSResid Root mean square of residuals. NumIter Number of iteration counts. nloglk Negative loglikelihood function value.

collapse all

Probabilistic Principal Component Analysis

Probabilistic principal component analysis (PPCA) is a method to estimate the principal axes when any data vector has one or more missing values.

PPCA is based on an isotropic error model. It seeks to relate a p-dimensional observation vector y to a corresponding k-dimensional vector of latent (or unobserved) variable x, which is normal with mean zero and covariance I(k). The relationship is

${y}^{T}=W\ast {x}^{T}+\mu +\epsilon ,$

where y is the row vector of observed variable, x is the row vector of latent variables, and ε is the isotropic error term. ε is Gaussian with mean zero and covariance of v*I(k), where v is the residual variance. Here, k needs to be smaller than the rank for the residual variance to be greater than 0 (v>0). Standard principal component analysis, where the residual variance is zero, is the limiting case of PPCA. The observed variables, y, are conditionally independent given the values of the latent variables, x. So, the latent variables explain the correlations between the observation variables and the error explains the variability unique to a particular yi. The p-by-k matrix W relates the latent and observation variables, and the vector μ permits the model to have a nonzero mean. PPCA assumes that the values are missing at random through the data set. This means that whether a data value is missing or not does not depend on the latent variable given the observed data values.

Under this model,

$y~N\left(\mu ,W*{W}^{T}+v*I\left(k\right)\right).$

There is no closed-form analytical solution for W and v, so their estimates are determined by iterative maximization of the corresponding loglikelihood using an expectation-maximization (EM) algorithm. This EM algorithm handles missing values by treating them as additional latent variables. At convergence, the columns of W spans the subspace, but they are not orthonormal. ppca obtains the orthonormal coefficients, coeff, for the components by orthogonalization of W.

 Tipping, M. E., and C. M. Bishop. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No.3, 1999, pp. 611–622.

 Roweis, S. “EM Algorithms for PCA and SPCA.” In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems. Vol.10 (NIPS 1997), Cambridge, MA, USA: MIT Press, 1998, pp. 626–632.

 Ilin, A., and T. Raiko. “Practical Approaches to Principal Component Analysis in the Presence of Missing Values.” J. Mach. Learn. Res.. Vol. 11, August, 2010, pp. 1957–2000.