What are the units of the bandwidth parameter for mvksdensity?

5 views (last 30 days)
I have been using the inbuilt mvksdensity function. For this function the user specifes a 'bandwidth' as a smoothing parameter (higher bandwidth = more smoothing).
This MW staff answer to a previous question states that the bandwidth "is essentially a smoothing parameter and applied on a dimension of data, so it covers all the data points". This implies that the smoothing kernel encompasses all of the data, however, running mvksdensity on the same dataset with the same evaluation points while increasing the bandwidth leads to greater computation time, suggesting that something changes in the kernel other than just the smoothing parameter. i.e.:
x = rand(10000,2)*100; % random data
[a,b] = meshgrid(1:1:100,1:1:100); % evaluation points
pts = [a(:) b(:)];
tic
f1 = mvksdensity(x,pts,'Bandwidth',2); % small bandwidth
toc
Elapsed time is 0.890474 seconds.
tic
f2 = mvksdensity(x,pts,'Bandwidth',10); % larger bandwidth but same evaluation points and data
toc
Elapsed time is 2.034411 seconds.
My guess would be that for speed, the kernel size is dependent on the bandwidth, and internally mvksdensity ignores points if they are too far away from an evaluation point to contribute to the final value? This would also explain why mvksdensity is slower when using a user defined kernel - in this case mvksdensity has to return all of the distances to the user defined kernel and thus actually covers all of the data points.
Does this make sense? Is there any documentation explaining this?

Answers (1)

Garmit Pant
Garmit Pant on 26 Sep 2023
Hello
I understand that you are trying to use the MATLAB function ‘mvksdensity’ to perform kernel smoothing density estimation and are trying to understand the effect of the ‘Bandwidth’ input argument on the computation time.
The kernel density estimator function can be represented as:
(Source: Wikipedia)
Here, h is the smoothing factor called bandwidth and K is the kernel function.
The ‘Bandwidth’ parameter accepts inputs of 2 forms:
  1. Scalar: Scalar bandwidth value applies to all dimensions. This means that the scalar value is used as the value of ‘h’ for all data points.
  2. d-element vector: Here, d is the number of columns of the sample data. Each value in the vector corresponds to a particular column and is used as the value ‘h’ for all data points of that column.
Bandwidth controls the smoothness of the fit of the kernel density estimator function. Larger the bandwidth, the density plot will look like a unimodal distribution and hide all non-unimodal distribution properties. The value of the ‘Bandwidth’ parameter has no effect on the kernel function used in the ‘mvksdensity’. Kernel functions are specified using the ‘Kernel’ parameter and all the kernel functions have fixed definitions, independent of the bandwidth specified. As seen in the equation above, ‘h’ changes the input value passed to the kernel function.
The increase in computation time can be attributed to the change in the input value being passed to the kernel function due to the increasing value of ‘h’ with a larger 'bandwidth’ value. As per my investigation, I found that the computation time starts converging for very large values of ‘Bandwidth’.
For further understanding, you can refer to the following MATLAB Documentation:
  1. https://in.mathworks.com/help/stats/mvksdensity.html - Refer to the ‘Input Arguments’ section
I hope this helps!
Best Regards
Garmit

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!