- https://www.mathworks.com/help/stats/pca.html#:~:text=Description-,on,-Default.%20pca
- https://www.mathworks.com/help/matlab/ref/cov.html#:~:text=is%20defined%20as-,cov,),-where%20%CE%BCA
question about how the function pca() calculates the covariance matrix internally
2 views (last 30 days)
Show older comments
I was puzzled by the output of pca() when using mean centering or not. I am using Matlab 2024a.
pca.m uses the internal function c = ncnancov(x,Rows,centered) which seems to provide the covariance matrix of x
however,
1) it uses the formula for the population covariance, i.e. it calculates x'*x/n not x'*x/(n-1) - what is the rationale behind that?
2) it does not mean center x. This is surprising because without mean centering x the formula x'*x/n (or x'*x/(n-1) for that matter) does NOT provide the covariance matrix
The second point causes the call [coeff,score,latent]=pca(D, 'Algorithm','eig’,'Centered','off') to produce different coeff, and latent from the call [coeff,score,latent]=pca(D, 'Algorithm','eig’). The scores will obviosuly be different but coeff and latent should not be affected by mean centering as can be shown by comparing the output of:
load('Data_Table8p1.mat');
Dm = D-mean(D);
[coeff,eigValues] = eig(cov(D));
[eigValues, idx] = sort(diag(eigValues), 'descend'); % sort
coeff = coeff(:, idx);
score = D/coeff'; % get scores of mean centered data
with:
[coeff_m,eigValues_m] = eig(cov(Dm));
[eigValues_m, idx] = sort(diag(eigValues_m), 'descend'); % sort
coeff_m = coeff_m(:, idx);
score_m = Dm/coeff_m'; % get scores of mean centered data
Probably I am missing something, but the internal function ncnancov() as used in pca is unclear to me. Any explanation is much appreciated!
0 Comments
Answers (1)
Divyam
on 18 Jul 2024
Hi Florian, the "pca" and the "cov" functions perform "mean centering" by default as mentioned here:
The example in the question leads to the same coefficients since both the "cov" calls return the same "coeff" and "coeff_m" as the data "D" is being mean centered by default. To illustrate this, I have written a code for calculating the covariance without mean centering and ran it on your data, the coefficients are different in this scenario. The code is added below for your reference:
% Not using the "cov" function
[N,M] = size(D);
cov_matrix = (1/(N-1)) * (D' * D);
[coeffFinal, eigValuesFinal] = eig(cov_matrix);
[eigValuesFinal, idx] = sort(diag(eigValuesFinal), 'descend');
coeffFinal = coeffFinal(:, idx);
Here is the output of the code:
4 Comments
Divyam
on 22 Jul 2024
Hi @Florian Meirer, the data used for PCA is very small and sparse (as evident in your plot) and thus using population covariance matrix is not helpful here. You are correct in using a sample covariance matrix. For this specific case, running "pca" with mean centering will unequivocably lead to correct results. In the code you will find that when you turn mean centering on, the sample covariance matrix is used to compute the results, which is exactly what you are doing in your non 'pca' code.
% In "ncnancov"
% Line 542
d = d + centered; % Here d becomes 1 when mean centering is on
% Line 551
c = x'*x/(n-d) % This becomes the result of sample covariance matrix
See Also
Categories
Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!