Somewhat lengthy question on distribution fitting

My apologies in advance for what I expect to be a lengthy background section leading up to my question.
I'm working on a set of decision analysis methodologies for making choices about alternatives that arrive asynchronously over a time period. One of the methods requires a cumulative distribution function (CDF) for one aspect of the alternatives. In exploring the performance of the methods, I try several different methods of generating the CDF for some sample data:
  1. Using an empirically-derived CDF generated by MatLab that precisely fits the observed data
  2. Using an approximation via a triangular distribution (since those are easy when data is scarce)
  3. Using an approximation that uses a "standard" distribution to fit the observed data
My question regards that third method. I have some sample data, and in my first shot at this I used the Arena Input Analyzer tool against three data sets. For two of the three it suggested distributions that performed almost as well as the empirically-derived exact fit. These were
  1. 25 + exponential(261)
  2. 127 + exponential(1030)
For the third data set though, it suggested LogNormal(1.96, 3.23) which worked like a dog....using that CDF literally performed worse than just flipping a coin at each decision point.
So I figured I'd use MatLab to fit distributions and see if I got better results. And for the one that Arena missed badly on, given the exact same text file of input data, MatLab suggested LogNormal(0.0185, 1.1458)...note the significantly different parameters. This worked like a champ-again as good as the empirical one. So I figured I'd go on with MatLab to fit the other two data sets. What MatLab suggested was
  1. LogNormal(5.2912, 0.8789)
  2. LogNormal(6.7327, 0.8078)
And these two were dogs! My suspicion is that it has something to do with that "offset" that you see in the Arena suggested distribution. MatLab seems to be trying to only fit to a "straight" distribution with no offset term like that.
So here's my question: is there a way to get MatLab to identify an offset term in examining a data set for distribution fitting?
Ideally my final methodology will just involve running a fit (if you have historical data to fit to) which I think is a fairly low bar. If you first have to examine the data and determine an appropriate offset manually and then adjust all the data to account for it, I think it's of less use.
I hope this makes sense, and that I haven't bored you sleep yet. Any help would be greatly appreciated.

4 Comments

It might help if you would post a histogram of the data you’re interested in, especially histograms of the various samples you want to distinguish.
Also, what do you mean by ‘offset’?
Second part first -- when I say offset, I mean the constant added to the random variable. So for example, one of the distributions Arena identified as 25 + exponential(261)...the "offset" would be 25. This ensures no random variable generated ever has a value less than 25.
As for histograms, I can take some screenshots and post them, but I'm looking for the "general" answer, not something specific to any one dataset.
Nonetheless, I agree with Star - screenshots would help us visualize, even if it's just for one example set of data.
Okay....here is a screenshot of the output I get from the Arena Input Analyzer
And using the MatLab Distribution Fitting Tool on the same data file
They choose significantly different binning. Arena estimates the distribution as 25 + expo(261) while MatLab return LogNormal(5.2912, 0.8789)

Sign in to comment.

Answers (3)

I haven't used those functions. I've never heard of the Arena Input Analyzer. What toolbox are they in? Please list it below your question. Is it the stats toolbox or curve fitting toolbox or something else?
What does the histogram of your actual data look like? Is it more like the bars in the top plot (like an exponential decay) or in the bottom plot (like a log-normal or Poisson)?
You say: " is there a way to get MatLab to identify an offset term in examining a data set for distribution fitting?" Can you subtract the mean and then see this: http://www.mathworks.com/matlabcentral/answers/94272-how-do-i-constrain-a-fitted-curve-through-specific-points-like-the-origin-in-matlab
By the way, for what it's worth, here's an interesting File Exchange submission that has dozens of distributions: http://www.mathworks.com/matlabcentral/fileexchange/7309-randraw

3 Comments

The Arena Input Analyzer is a utility that ships with Arena, which is a discrete event simulation tool. I only used it because a) I still have a copy of the academic version and b) I'm very familiar with it from my undergrad work. I'd prefer to stick to strictly MatLab, but at the moment the distributions it's suggesting for the data aren't doing much good (see above). Thanks for the tip on the random variate generator. I've also been using a file exchange submission that fits all the distributions the distribution fitting tool provides and returns a rank ordered list of best fits....
Cool - what submission is that?

Sign in to comment.

Your confusion arrises from the fact that the parameters used for a lognormal distribution in Matlab represent the parameters from the underlying normal distribution. If you want to use those in Rockwell Arena, you'll first need to transform them into the mu and sigma from the lognormal distribution (https://en.wikipedia.org/wiki/Log-normal_distribution#Arithmetic_moments). Then you'll see that the parameters found with the Input Analyzer tool in Arena closely resemble the parameter estimates you get from Matlab.
You might be able to do a lot of what you want with the routines here: Cupid
To create a standard distribution with an offset, you would write something like this:
% Create exponential distribution with an offset of 100.
mydist=AddTrans(Exponential(.01),100);
Assuming your to-be-fitted data is in an array x, you could then get maximum likelihood estimates of the exponential rate and additive offset with:
mydist.EstML(x)
Actually, in the case of an exponential plus a constant, the MLE of the constant will always be the minimum value in the data set (perhaps minus a few eps to avoid numerical problems).

Asked:

on 20 Apr 2015

Edited:

on 21 May 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!