Find the K most orthogonal vectors in a set of vectors

3 views (last 30 days)
Hello All,
The context of this particular search is a step in tuning a spectral clustering routine a la Ng et al 2002. The purpose of this search is to give a initialization point for k-means clustering in higher dimensional space. In particular, well separated data ought to sit in K tight clusters on the surface of a hypersphere, so the purpose of this search is to find the locations of these cluster centroids with which to initialize k-means clustering of the same data.
I have a data matrix "Y" with O(10^4) rows and O(10^2-10^3) columns (the columns of this matrix are the [transformed and normalized a couple times] K largest eigenvectors of the affinity matrix).
Ng et al suggest "Briefly, we let the first cluster centroid be a randomly chosen row of Y, and then repeatedly choose as the next centroid the row of Y that is closest to being 90 degrees from all the centroids already picked." I translated this mathspeak to mean I need to take a bunch of dot products and look for values close to zero. This was quoted as computationally cheap (perhaps it is for clustering fewer points into say O(10^1) clusters or perhaps they meant it in the sense that it requires fewer iterations of k-means clustering once initialized), but my CPU is dragging ass at it.
So far I've tried 2 approaches: Approach #1 - Compute everything then search
% "cheap" initialization of k-means
dotProductY = zeros(length(Y)); %preallocate to make the parser turn green
% compute dot product of every row with every other row first
for k = 1:length(Y)
dotProductY(:,k) = sum(bsxfun(@times,Y(k,:),Y),2);
end
dotProductY(logical(eye(length(Y)))) = nan; %exclude dot product of row with self
centroidIdx = randi(length(Y)); %initialize on a random row of Y
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
dotProductY = abs(dotProductY); %use the absolute value because looking for closer to zero
for k = 2:K
[~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
centroidIdx(k) = im; %reassign
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
end
Approach #2 - Simultaneous computation and search
%try a cheaper one?
centroidIdx = randi(length(Y)); %initialize on a random row of Y
for k = 1:K-1
dotProductY(:,k) = sum(bsxfun(@times,Y(centroidIdx(k),:),Y),2); %compute inner product
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
[~,im] = min(sum(abs(dotProductY),2)); %find next best centroid
centroidIdx(k+1) = im; %reassign
end
Neither of these approaches seems cheap to me. Anyone else take a stab at this before? Any suggestions?

Accepted Answer

Matt J
Matt J on 19 May 2016
Edited: Matt J on 19 May 2016
Your computation of dotProductY could be more efficient. The most vectorized may of computing it, I believe is
dotProductY=abs(Y*Y.');
I expect that would have been the main bottleneck.
  2 Comments
Matt J
Matt J on 19 May 2016
Edited: Matt J on 22 May 2016
This part
for k = 2:K
[~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
centroidIdx(k) = im; %reassign
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
end
also looks like it could be incrementalized as follows
im=randi(size(Y,1));
centroidIdx(1)=im;
temp=dotProductY(:,im);
for k = 2:K
[~,im] = min(temp); %find next best centroid
centroidIdx(k) = im; %reassign
temp=temp+dotProductY(:,im); %update temp
temp(im) = inf; %dont pick the same row twice
end
Peter Cook
Peter Cook on 8 Jun 2016
Thanks for the help, I can't believe I had that boneheaded dotProductY computation in there. The algorithm runtime is still quite slow, but that, I am accepting, is to be expected for most clustering algorithms with this amount of data.

Sign in to comment.

More Answers (0)

Categories

Find more on Statistics and Machine Learning Toolbox in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!