A distribution binning problem

Hi,
I have a problem in which I need to count the number of (probabilistic) occurrences falling in each non-uniform interval of a distribution.
While the final goal itself can apparently be achieved with histcounts, my problem is upstream, that is with the data about the population to be binned, which is not given by a sample, but by some known parameters about the (entire) population. Here is an exemplification, using packages and their weight to illustrate the problem, which is more general. N.B. I do have the Statistics and Machine Learning toolbox, but I'm not an expert statistician myself.
I have a set of N = 100 packages, and their total weight, W = 1000 kg. Let's say that we know how the weight of packages is distributed (about the mean), and that the variance is also a known, exogenous, parameter. The minimum and maximum weight of the packages in the lot is also known. To recap:
Number of packages N = 100;
Total weight W = 1000 kg;
minimum weight of package wmin = 2 kg;
maximum weight of package wmax = 20 kg;
mean weight = mu = W/N = 1000 kg/100 = 10 kg
variance = sigmacap = 4 (exogenously determined)
distribution of weights about the mean = N(mu,sigmacap) in case of normal distribution
With the above input, how should I proceed in having a (probabilistic) count of how many packages will fall in unqually spaced weight intervals of the type 2-5, 5-10, 10-12, 12-16 and 16-20 kilograms?
Thank you very much for any help or lead you can offer.
Daniele

5 Comments

This is not a question about MATLAB, but a question about statistics. You don't need to be an expert, but you do need to understand what the CDF of a normal distribution tells you, and how to use it. (I said a normal distribution, because you explicitly stated normality. If you did not know the distribution, then nothing can be done along these lines anyway.)
So, what does the Normal CDF tell you? And how does it help you? It is your homework, not mine. I'll even add a bit. If you take the difference betwee the normal CDF at two points, what would that tell you?
Dear John,
Thanks for your answer. I've been a Matlab user for a long time (too long!), and I know your name as one of the authors of some of the most useful user-contributed snippets ever written. I must still have some of them in my archive.
As for the case in point, I don't have a specific background in statistics: only shreds of memories from 20+ years ago. Rather than going back on the books, for which I have little time at present, I was wondering if the statistics toolbox could offer some quick-and-dirty way to get the task done, from someone fresher on stats than I am, which would know how to go about this with a three liner of code and kind enough to share.
If you can help, that would be great.
Thanks
Daniele
Let me be more clear, without totally doing your homework. What would this
normcdf(2,mu,sigmacap)
tell you? Then, ask what does this mean?
normcdf(5,mu,sigmacap)
What are those calls doing? (Hint: each of them can be interpreted as probabilites, although more explicitly, they are the area under a Normal PDF. But what would they mean in your problem?)
Now, what would the difference between those results imply? (Hint: it could also be interpreted as a probability.)
Now, you have N such packages. If you multiplied the above difference by N, what would that mean?
You are asking what fraction of events in different categories happen out of a total of N events. It you know the probability of that event arising, and you know the sample size, then what is the expected number of such events?
I'm sorry if I am not giving you explicit code to compute what you are asking (really, I almost did that if you look at what I wrote in this comment), but these are very basic questions about probability. If you are unable to answer basic questions about probability, then you really do need to crack those notes on probability, or maybe a simple book. You won't be able to handle the harder stuff when it comes up, if the most basic stuff is tossing you a curve.
Now, this is a lot more cut down to a size I think I can handle, rather than going through half the probability theory just to have one isolated problem solved. I'll take it up as a challenge, and I'll put my head into it starting tonight, after work. Whatever knowledge of stats I had, it has been left to rust for far too many years.
Thanks for your time and good leads; will revert once through.
Kind regards
Daniele
John,
thanks to your clues I managed to put down the code I needed to answer my question. It was, after all, a good idea to ask. I'm putting it in a separate answer below for anyone interested.
If you have any further observation, it is of course welcome.
Thanks again.
All the best
Daniele

Sign in to comment.

 Accepted Answer

Inspired and encouraged by John D'Errico advice above, I post below the code that solves the submitted distribution problem.
N = 100; % <- number of packages
W = 1000; % <- total weight in kg
mu = W/N; % <- average weight
sigmacap = 4; % <- variance (exogenously determined)
wmin = 2; % <- minimum weight of package in kg
wmax = 20; % <- maximum weight of package in kg
interValues = [wmin 5 10 12 16 wmax]; % edges of the weight bins
pd = makedist('Normal',mu,sqrt(sigmacap)); % <- create a normal distribution with the given parameters;
pdt = truncate(pd,wmin,wmax); % <- truncate the distribution to exclude packages < 2 kg or > 20 kg;
% each element in packCount represents the expected number of packages in the weight range (bin) [interValues(i) interValues(i+1];
packCount = NaN(1,numel(interValues)-1);
for i = 1:numel(packCount)
packCount(i) = round(diff([cdf(pdt,interValues(i)), cdf(pdt,interValues(i+1))])*N);
end

More Answers (0)

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!