Automatically fitting distribution to histogram

Hello
I have a histogram plot of one feature (machine learning). That mean on the x-axis I have several value the feature can take and on the y-axis I have the number of occurences.
Is it possible in Matlab to automatically fit a probability distribution to this histogram if I don't know which type of distribution it is (normal distribution or geometric distribution etc.)? That means Matlab should figure out which distribution it is and give me the optimal parameters.
The problem is that I have a lot of features and manually inspecting the features takes too much time.

 Accepted Answer

You can try all the distributions that fitdist() offers you and find which one has the lowest MSE or MAD.

7 Comments

Thanks for the input. Would you reccommend MSE or MAD (I think there are also others)?
Second, how do you calculate MSE or MAD for a probability distribution.
Third, is there an option to automatically grab all distribtions or do I have to manually specify them in fitdist?
I have found the following Matlab tool which does the job: http://blogs.mathworks.com/pick/2012/02/10/finding-the-best/
Looks awesome but unfortuantely only supports parametric models. :(
Sepp
Sepp on 29 Feb 2016
Edited: Sepp on 29 Feb 2016
I'm now a bit confused. I have read that I have to normalize my histogram so that I see empirical probabilities instead of the numbers. Is this true if I want to try all possible parametric distribution and pick the best one? How can this normalization be done in Matlab?
If you're using HISTCOUNTS or HISTOGRAM, see the Normalization option.
As far as MSE or MAD goes, my statistician and I prefer Median Absolute Deviation rather than Mean Squared Error, or RMSE, or Mean or Average Absolute Deviation. It seems to be more like what people would expect and is less affected by how large the deviation is. With RMSE or AAD or especially MSE, a single really big outlier can throw your MSE way way off from what it would be if just that one point was ignored. The Median Absolute Error is rather well behaved and tolerant of outliers without making the metric go haywire.
Sepp
Sepp on 29 Feb 2016
Edited: Sepp on 29 Feb 2016
Thanks a lot. Is normalization required before searching the best fitting distribution?
And a last small question: If I would like to create a histogram of values (that means the empirical distribution), how should I choose the bin width?
No, only if you want the area under the curve to be 1, like it would for a regular probability density function.
The bin width depends on what kind of resolution you want in the x direction. You might not want to do so many bins that each bin has only 1 or 0 counts in it, but other than that, it's up to you.

Sign in to comment.

More Answers (0)

Asked:

on 29 Feb 2016

Edited:

on 29 Feb 2016

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!