I'm trying to write a function for estimating k-nearest neighbours pdf in one dimension. I've been going through this several times already and can't figure out what is wrong. The visualisation shows that my 'pdf' is clearly not how it should be: there's a peak on top of one sample and a sample-wise more dense area is flat. Any advice and corrections appreciated! Here is my code, the test data 't122' is a 1x10 vector i.e. ten 1D samples:
x = [0.553766713954610,0.683388501459509,0.274115313899635,0.586217332036812,0.531876523985898,0.369231170369473,0.456640797769432,0.534262446653865,0.857839693972576,0.776943702988488];
d = size(x,1);
d2 = size(x,2);
% k samples inside the Parzen window
k = 3; % sqrt(N) is a good guess for optimal k
% plotting the samples and the estimated pdf
xAxis = linspace(0,1,100);
plot(xAxis,nnPdf(xAxis,x,k));
title('t122 on the real line with nn-estimated pdf');
hold on;
plot(x,0,'o','MarkerSize',25);
legend(sprintf('%d nearest neighbours pdf',k),'t122');
And here is the function:
% k nearest neighbours 1D pdf-estimator function nnPdf()
% inputs:
% x0 = interval for the pdf
% x = data for which the pdf is estimated
% k = number of samples in every Parzen window
% output:
% V = 1D-pdf estimated with k nearest neighbours
function V = nnPdf(x0,x,k)
v = zeros(length(x0),size(x,2)); % for distances to all samples
V = zeros(length(x0),1); % for distance needed to include k samples
if k > size(x,2)
disp('*Invalid value for k: not so many samples in the data.');
return
end
standardize(x);
for i = 1:length(x0)
for j = 1:size(x,2)
% distance from interval point to all samples
v(i,j) = abs(x0(i)-x(j));
end
% sorted distances so v_ik is the distance for reaching to the
% kth sample from the point x0_i
sort(v,2);
% window size V at point x0_i based on the distance (volume in 1D)
V(i) = (k/size(x,2)) * 1/v(i,k);
end
end
And the outcome: