question about how the function pca() calculates the covariance matrix internally

2 views (last 30 days)
I was puzzled by the output of pca() when using mean centering or not. I am using Matlab 2024a.
pca.m uses the internal function c = ncnancov(x,Rows,centered) which seems to provide the covariance matrix of x
however,
1) it uses the formula for the population covariance, i.e. it calculates x'*x/n not x'*x/(n-1) - what is the rationale behind that?
2) it does not mean center x. This is surprising because without mean centering x the formula x'*x/n (or x'*x/(n-1) for that matter) does NOT provide the covariance matrix
The second point causes the call [coeff,score,latent]=pca(D, 'Algorithm','eig’,'Centered','off') to produce different coeff, and latent from the call [coeff,score,latent]=pca(D, 'Algorithm','eig’). The scores will obviosuly be different but coeff and latent should not be affected by mean centering as can be shown by comparing the output of:
load('Data_Table8p1.mat');
Dm = D-mean(D);
[coeff,eigValues] = eig(cov(D));
[eigValues, idx] = sort(diag(eigValues), 'descend'); % sort
coeff = coeff(:, idx);
score = D/coeff'; % get scores of mean centered data
with:
[coeff_m,eigValues_m] = eig(cov(Dm));
[eigValues_m, idx] = sort(diag(eigValues_m), 'descend'); % sort
coeff_m = coeff_m(:, idx);
score_m = Dm/coeff_m'; % get scores of mean centered data
Probably I am missing something, but the internal function ncnancov() as used in pca is unclear to me. Any explanation is much appreciated!

Answers (1)

Divyam
Divyam on 18 Jul 2024
Hi Florian, the "pca" and the "cov" functions perform "mean centering" by default as mentioned here:
The example in the question leads to the same coefficients since both the "cov" calls return the same "coeff" and "coeff_m" as the data "D" is being mean centered by default. To illustrate this, I have written a code for calculating the covariance without mean centering and ran it on your data, the coefficients are different in this scenario. The code is added below for your reference:
% Not using the "cov" function
[N,M] = size(D);
cov_matrix = (1/(N-1)) * (D' * D);
[coeffFinal, eigValuesFinal] = eig(cov_matrix);
[eigValuesFinal, idx] = sort(diag(eigValuesFinal), 'descend');
coeffFinal = coeffFinal(:, idx);
Here is the output of the code:
  4 Comments
Florian Meirer
Florian Meirer on 19 Jul 2024
Edited: Florian Meirer on 19 Jul 2024
many thanks for your reply - this is really useful and interesting and you are fully correct that eig(X'*X/(N-1)) (or eig(X'*X/N)) provides an orthonormal basis set that can be used to express the data. However, this is not the correct way of doing PCA, because then it is not guaranteed that the eigenvectors describe the data set in terms of its variance, which is the whole point of doing PCA. I have pasted code below you can run with the data set I provided (use only 2 dimensions of the data set to see the 2D plot, but you can use any two dimensions). There you can see that using the eigenvalue decomposition of X'*X/N leads to completely incorrect results, that is, a rotation of the basis set that does not describe the data in terms of its largest, second largest, third largest, etc variance and in some cases it even changes PC1 into PC2 and vice versa. The eigenvalues used to determine the CVE are extremly different, suggesting, for example, that the data set would be very sufficiently described by only 1 PC while actually 2 are needed to achieve that level of CVE. That is, because these eigenvalues do not describe the data set in terms of its variance.
Please let me know what you think, I really appreciate this conversation!
To run the code:
testPCA(D(:,2:3));
or testPCA(D(:,[1 3])) or testPCA(D(:,1:2)) etc
function testPCA(D)
% use only 2D data for th 2D plot
% 1) Using the covariance matrix
[m,~]=size(D);
[eigVect,EV] = eig(cov(D));
[EV, idx] = sort(diag(EV), 'descend');
eigVect = eigVect(:, idx);
% 2) Using the way pca() determines the eigenvectors and eigenvalues:
[eigVect1,EV1] = eig(D'*D/m);
[EV1, idx] = sort(diag(EV1), 'descend');
eigVect1 = eigVect1(:, idx);
% plot the eigenvectors and the data:
mD = mean(D);
x_0 = mD(1,1);
y_0 = mD(1,2);
d = sqrt(EV);
d1 = sqrt(EV1);
figure();
subplot(1,2,1);
plot(D(:,1),D(:,2),'o');
hold on;
quiver(x_0,y_0,eigVect(1,1),eigVect(2,1),d(1),'k','LineWidth',5);
quiver(x_0,y_0,eigVect(1,2),eigVect(2,2),d(2),'r','LineWidth',5);
title('using eig(cov(D))');
hold off;
axis equal;
subplot(1,2,2);
plot(D(:,1),D(:,2),'o');
hold on;
quiver(x_0,y_0,eigVect1(1,1),eigVect1(2,1),d1(1),'k','LineWidth',5);
quiver(x_0,y_0,eigVect1(1,2),eigVect1(2,2),d1(2),'r','LineWidth',5);
title('using eig(D''*D/m)');
hold off;
axis equal;
Divyam
Divyam on 22 Jul 2024
Hi @Florian Meirer, the data used for PCA is very small and sparse (as evident in your plot) and thus using population covariance matrix is not helpful here. You are correct in using a sample covariance matrix. For this specific case, running "pca" with mean centering will unequivocably lead to correct results. In the code you will find that when you turn mean centering on, the sample covariance matrix is used to compute the results, which is exactly what you are doing in your non 'pca' code.
% In "ncnancov"
% Line 542
d = d + centered; % Here d becomes 1 when mean centering is on
% Line 551
c = x'*x/(n-d) % This becomes the result of sample covariance matrix

Sign in to comment.

Categories

Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange

Products


Release

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!