Clear Filters
Clear Filters

How to use chi2gof within CUPID

6 views (last 30 days)
Sim
Sim on 22 Jun 2023
Commented: Sim on 26 Jun 2023
[The same question on the CUPID GitHub]
Two examples of usage of the Matlab's "Chi-square goodness-of-fit test" (chi2gof) function are the following:
First (comparing two frequency distributions):
Population = [996, 749, 370, 53, 9, 3, 1, 0];
Sample = [647, 486, 100, 22, 0, 0, 0, 0];
Population2 = [996, 749, 370, sum(Population(4:8))];
Sample2 = [647, 486, 100, sum(Sample(4:8))];
x = [];
for i = 1:length(Sample2)
x = [x,i*ones(1,Sample2(i))];
end
edges = .5+(0:length(Sample2));
[h,p,k] = chi2gof(x,'Expected',Population2,'Edges',edges)
Second (fit a distribution to data):
bins = 0:5;
obsCounts = [6 16 10 12 4 2];
n = sum(obsCounts);
pd = fitdist(bins','Poisson','Frequency',obsCounts');
expCounts = n * pdf(pd,bins);
[h,p,st] = chi2gof(bins,'Ctrs',bins,...
'Frequency',obsCounts, ...
'Expected',expCounts,...
'NParams',1)
But, how can I use the chi2gof function within CUPID?
Here below an example where I would like to use the Matlab's chi2gof function :
addpath('.../Cupid-master')
% (1) create a "truncated dataset"
pd = makedist('Weibull','a',3,'b',5);
t = truncate(pd,3,inf);
data_trunc = random(t,10000,1);
% (2) fit a distribution (in this case the "Weibull2") to the "truncated test"
fittedDist = TruncatedXlow(Weibull2(2,2),3);
% (3) estimate the Weibull parameters by maximum likelihood, allowing for the truncation.
fittedDist.EstML(data_trunc);
% (4) plot both the "truncated test" (through the histogram) and the "fitting distribution"
% (in this case the "Weibull2" with Weibull's parameters estimated by maximum likelihood)
figure
xgrid = linspace(0,100,1000)';
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,fittedDist.PDF(xgrid),'Linewidth',2,'color','red')
xlim([2.5 6])

Accepted Answer

Jeff Miller
Jeff Miller on 23 Jun 2023
Yes, that is correct. The successive bin probabilities are the differences of the successive CDF values, and the expected number is the total N times the bin probability--just as you have computed it.
  2 Comments
Sim
Sim on 23 Jun 2023
Thanks a lot @Jeff Miller, very kind!! :-)
Sim
Sim on 26 Jun 2023
To future readers
I accepted the @Jeff Miller's answer
"Yes, that is correct. The successive bin probabilities are the differences of the successive CDF values, and the expected number is the total N times the bin probability--just as you have computed it."
since it confirms what I showed in my Answer (please see my two examples called "Test 1" and "Test 2"):
"I might have found a solution that makes sense to me and gives me what I would expect, even though I am not 100% sure it is correct... maybe, experts of CUPID and chi2gof might tell me if this is correct.... Test 1.... Test 2....."

Sign in to comment.

More Answers (1)

Sim
Sim on 22 Jun 2023
Edited: Sim on 22 Jun 2023
I might have found a solution that makes sense to me and gives me what I would expect, even though I am not 100% sure it is correct... maybe, experts of CUPID and chi2gof might tell me if this is correct:
Test 1: I produce an artifical set of data following a distribution (A) and I fit those data with the same distribution (A)
% (1) create a "truncated dataset"
pd = makedist('Exponential','mu',1); % <-- dataset following a distribution (A)
whereToTruncate = 2;
t = truncate(pd,whereToTruncate,inf);
data_trunc = random(t,10000,1);
% (2) fit a distribution to the "truncated test"
fittedDist = TruncatedXlow(Exponential(1),whereToTruncate); % <-- fitting distribution (A)
% (3) estimate the distribution parameters by maximum likelihood, allowing for the truncation.
fittedDist.EstML(data_trunc);
% (4) plot both the "truncated test" (through the histogram) and the "fitting distribution"
figure
xgrid = linspace(0,10,1000)';
num_bins = 50;
hold on
histogram(data_trunc,num_bins,'Normalization','pdf','facecolor','blue')
line(xgrid,fittedDist.PDF(xgrid),'Linewidth',2,'color','red')
hold off
xlim([0 7])
% (5) calculate the Chi-square goodness-of-fit test (chi2gof)
bin_edges = linspace(min(data_trunc), max(data_trunc), num_bins+1);
expected_values = numel(data_trunc) * diff(fittedDist.CDF(bin_edges));
[h,p,st] = chi2gof(data_trunc, 'Expected', expected_values)
% Output Test 1
h =
0
p =
0.55248
st =
struct with fields:
chi2stat: 21.469
df: 23
edges: [2.0001 2.2661 2.5321 2.7982 3.0642 3.3302 3.5963 3.8623 4.1283 4.3944 4.6604 4.9264 5.1925 5.4585 5.7245 5.9906 ]
O: [2368 1798 1344 1107 810 594 442 333 294 212 165 116 113 68 53 37 33 28 15 15 18 11 5 21]
E: [2348.7 1797.1 1375 1052 804.95 615.89 471.24 360.56 275.87 211.08 161.5 123.57 94.548 72.341 55.351 42.35 32.404 ]
Test 2: I produce an artifical set of data following a distribution (A) and I fit those data with a different distribution (B)
% (1) create a "truncated dataset"
pd = makedist('Exponential','mu',1); % <-- dataset following a distribution (A)
whereToTruncate = 2;
t = truncate(pd,whereToTruncate,inf);
data_trunc = random(t,10000,1);
% (2) fit a distribution to the "truncated test"
fittedDist = TruncatedXlow(Normal(0,1),whereToTruncate); % <-- fitting distribution (B)
% (3) estimate the distribution parameters by maximum likelihood, allowing for the truncation.
fittedDist.EstML(data_trunc);
% (4) plot both the "truncated test" (through the histogram) and the "fitting distribution"
figure
xgrid = linspace(0,10,1000)';
num_bins = 50;
hold on
histogram(data_trunc,num_bins,'Normalization','pdf','facecolor','blue')
line(xgrid,fittedDist.PDF(xgrid),'Linewidth',2,'color','red')
hold off
xlim([0 7])
% (5) calculate the Chi-square goodness-of-fit test (chi2gof)
bin_edges = linspace(min(data_trunc), max(data_trunc), num_bins+1);
expected_values = numel(data_trunc) * diff(fittedDist.CDF(bin_edges));
[h,p,st] = chi2gof(data_trunc, 'Expected', expected_values)
% Output Test 2
h =
1
p =
6.4417e-116
st =
struct with fields:
chi2stat: 628.59
df: 26
edges: [2.0001 2.1895 2.3789 2.5682 2.7576 2.947 3.1364 3.3258 3.5152 3.7046 3.8939 4.0833 4.2727 4.4621 4.6515 4.8409 ]
O: [1742 1409 1198 959 798 699 561 463 391 295 266 205 162 135 114 102 86 73 56 51 39 30 22 18 16 20 90]
E: [1386.2 1248.4 1114.2 985.49 863.77 750.27 645.8 550.88 465.67 390.1 323.84 266.42 217.2 175.48 140.5 111.47 87.65 ]

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!