Principal component Analysis example on Matlab

8 views (last 30 days)
KaMu
KaMu on 25 Jun 2014
Commented: Image Analyst on 26 Aug 2024
I think there is something wrong here. I am applying the PCA through the statistical tool. I have a data XData that range from 1-0.9 with 512 dimension. I am using the PCA to reduce the dimension. I was following the example on: http://www.mathworks.com/help/stats/feature-transformation.html#f75476
I have applied : [coeff,score,latent] = pca(XData);
Then to transform the coefficients so they are orthonormal :
coefforth = inv(diag(std(XData)))*wcoeff;
when I test the data using : cscores = zscore(XData)*coefforth;
I can see that cscores and score are both different. Note that I didn’t need to wight my data.
I have also tried with a new data set :

Answers (1)

arushi
arushi on 26 Aug 2024
Hi Kamu,
It seems like you're trying to perform Principal Component Analysis (PCA) on your data using MATLAB and are encountering issues with transforming the coefficients to be orthonormal.Here are some things you may check:
  • Data Standardization: Ensure that XData is standardized if you are manually computing scores. The discrepancy can arise if XData is not centered and scaled.
  • Coefficient Transformation: The transformation inv(diag(std(XData)))*wcoeff is unnecessary if you are using pca directly, as coeff is already orthonormal.
  • Variable Naming: Ensure that wcoeff is correctly defined if you are using it separately. It seems you intended to use coeff.
Hope this helps.
  1 Comment
Image Analyst
Image Analyst on 26 Aug 2024
It looks like pca has an option 'Centered', and the option is turned on by default. This will center the data columns about the mean for each column. You can turn it off if you want. It does not appear to do scaling so if your data had wildly different value ranges, then you'd want to scale them all to the same range, like 0-1, using rescale after you manually center them by subtracting the mean. If you're manually centering and scaling the data, you'd want the 'Centered' option to be off since it's already centered.
help pca
PCA Principal Component Analysis (PCA) on raw data. COEFF = PCA(X) returns the principal component coefficients for the N by P data matrix X. Rows of X correspond to observations and columns to variables. Each column of COEFF contains coefficients for one principal component. The columns are in descending order in terms of component variance (LATENT). PCA, by default, centers the data and uses the singular value decomposition algorithm. For the non-default options, use the name/value pair arguments. [COEFF, SCORE] = PCA(X) returns the principal component score, which is the representation of X in the principal component space. Rows of SCORE correspond to observations, columns to components. The centered data can be reconstructed by SCORE*COEFF'. [COEFF, SCORE, LATENT] = PCA(X) returns the principal component variances, i.e., the eigenvalues of the covariance matrix of X, in LATENT. [COEFF, SCORE, LATENT, TSQUARED] = PCA(X) returns Hotelling's T-squared statistic for each observation in X. PCA uses all principal components to compute the TSQUARED (computes in the full space) even when fewer components are requested (see the 'NumComponents' option below). For TSQUARED in the reduced space, use MAHAL(SCORE,SCORE). [COEFF, SCORE, LATENT, TSQUARED, EXPLAINED] = PCA(X) returns a vector containing the percentage of the total variance explained by each principal component. [COEFF, SCORE, LATENT, TSQUARED, EXPLAINED, MU] = PCA(X) returns the estimated mean, MU, when 'Centered' is set to true; and all zeros when set to false. [...] = PCA(..., 'PARAM1',val1, 'PARAM2',val2, ...) specifies optional parameter name/value pairs to control the computation and handling of special data types. Parameters are: 'Algorithm' - Algorithm that PCA uses to perform the principal component analysis. Choices are: 'svd' - Singular Value Decomposition of X (the default). 'eig' - Eigenvalue Decomposition of the covariance matrix. It is faster than SVD when N is greater than P, but less accurate because the condition number of the covariance is the square of the condition number of X. 'als' - Alternating Least Squares (ALS) algorithm which finds the best rank-K approximation by factoring a X into a N-by-K left factor matrix and a P-by-K right factor matrix, where K is the number of principal components. The factorization uses an iterative method starting with random initial values. ALS algorithm is designed to better handle missing values. It deals with missing values without listwise deletion (see {'Rows', 'complete'}). 'Centered' - Indicator for centering the columns of X. Choices are: true - The default. PCA centers X by subtracting off column means before computing SVD or EIG. If X contains NaN missing values, NANMEAN is used to find the mean with any data available. false - PCA does not center the data. In this case, the original data X can be reconstructed by X = SCORE*COEFF'. 'Economy' - Indicator for economy size output, when D the degrees of freedom is smaller than P. D, is equal to M-1, if data is centered and M otherwise. M is the number of rows without any NaNs if you use 'Rows', 'complete'; or the number of rows without any NaNs in the column pair that has the maximum number of rows without NaNs if you use 'Rows', 'pairwise'. When D < P, SCORE(:,D+1:P) and LATENT(D+1:P) are necessarily zero, and the columns of COEFF(:,D+1:P) define directions that are orthogonal to X. Choices are: true - This is the default. PCA returns only the first D elements of LATENT and the corresponding columns of COEFF and SCORE. This can be significantly faster when P is much larger than D. NOTE: PCA always returns economy size outputs if 'als' algorithm is specifed. false - PCA returns all elements of LATENT. Columns of COEFF and SCORE corresponding to zero elements in LATENT are zeros. 'NumComponents' - The number of components desired, specified as a scalar integer K satisfying 0 < K <= P. When specified, PCA returns the first K columns of COEFF and SCORE. 'Rows' - Action to take when the data matrix X contains NaN values. If 'Algorithm' option is set to 'als, this option is ignored as ALS algorithm deals with missing values without removing them. Choices are: 'complete' - The default action. Observations with NaN values are removed before calculation. Rows of NaNs are inserted back into SCORE at the corresponding location. 'pairwise' - If specified, PCA switches 'Algorithm' to 'eig'. This option only applies when 'eig' method is used. The (I,J) element of the covariance matrix is computed using rows with no NaN values in columns I or J of X. Please note that the resulting covariance matrix may not be positive definite. In that case, PCA terminates with an error message. 'all' - X is expected to have no missing values. All data are used, and execution will be terminated if NaN is found. 'Weights' - Observation weights, a vector of length N containing all positive elements. 'VariableWeights' - Variable weights. Choices are: - a vector of length P containing all positive elements. - the string 'variance'. The variable weights are the inverse of sample variance. If 'Centered' is set true at the same time, the data matrix X is centered and standardized. In this case, PCA returns the principal components based on the correlation matrix. The following parameter name/value pairs specify additional options when alternating least squares ('als') algorithm is used. 'Coeff0' - Initial value for COEFF, a P-by-K matrix. The default is a random matrix. 'Score0' - Initial value for SCORE, a N-by-K matrix. The default is a matrix of random values. 'Options' - An options structure as created by the STATSET function. PCA uses the following fields: 'Display' - Level of display output. Choices are 'off' (the default), 'final', and 'iter'. 'MaxIter' - Maximum number of steps allowed. The default is 1000. Unlike in optimization settings, reaching MaxIter is regarded as convergence. 'TolFun' - Positive number giving the termination tolerance for the cost function. The default is 1e-6. 'TolX' - Positive number giving the convergence threshold for relative change in the elements of L and R. The default is 1e-6. Example: load hald; [coeff, score, latent, tsquared, explained] = pca(ingredients); See also PPCA, PCACOV, PCARES, BIPLOT, BARTTEST, CANONCORR, FACTORAN, ROTATEFACTORS. Documentation for pca doc pca Other uses of pca gpuArray/pca tall/pca

Sign in to comment.

Categories

Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!