Regression with tall array (Using datastore, CSV) - Error

Hi

5 Comments

Do you mean?
result = fitglm(x, y, 'Distribution', 'binomial', 'Link', 'logit');
because you have an extra ) there (though I'm sure the error nags about something else).
Can you confirm you have tall arrays (for x and y)?
istall(x)
ans =
logical
1
Also, are you trying to set the fromula? because error says so, but your call to fitglm doesn't show this.
Yes, your fitglm-line is the one I have, the ) was a copy-paste error.
And yes, x and y are both tall arrays.
No, I am not calling a special formula.
can you share the output of your dependent/independent variables?
x
y
x is a 1000x500 (tall) table. This are the first entries:
7 6 12 12 15 13 12 30 71 6
3 4 4 0 0 1 10 2 6 1
1 0 0 0 0 0 2 0 0 0
1 0 4 0 0 0 0 0 4 0
6 3 5 2 0 0 10 0 3 0
3 26 10 3 0 2 15 7 24 1
17 85 5 4 0 0 29 0 6 0
1 0 1 0 0 2 1 0 0 0
2 0 3 0 0 0 9 0 4 0
5 18 11 2 0 1 6 0 3 0
3 1 0 0 0 2 4 0 0 0
2 0 0 0 0 0 0 0 0 0
2 0 10 0 0 0 0 0 0 0
2 0 1 1 0 3 0 0 3 0
2 16 3 0 0 0 3 2 36 1
y is a 1000x1 (tall) table and the first entries are:
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
I just tried to see if it was tall arrays and fitglm
>> X=[1:1000].'; X=tall(X);
>> Y=randn(size(X)); % this is interesting sidelight on the way...
Error using randn
Size inputs must be numeric.
>> size(X)
ans =
1×2 tall double row vector
1000 1
>> Y=randn(1000,1); Y=tall(Y); % OK, have to brute-force it
>> fitglm(X,Y,'Distribution',"normal")
Iteration [1]: 0% completed
Iteration [1]: 50% completed
Iteration [1]: 100% completed
Iteration [2]: 0% completed
Iteration [2]: 50% completed
Iteration [2]: 100% completed
Iteration [3]: 0% completed
Iteration [3]: 100% completed
ans =
Compact generalized linear regression model:
y ~ 1 + x1
Distribution = Normal
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ________ _______
(Intercept) 0.0015036 0.064429 0.023338 0.98139
x1 1.6177e-05 0.00011151 0.14507 0.88468
1000 observations, 998 error degrees of freedom
Estimated Dispersion: 1.04
F-statistic vs. constant model: 0.021, p-value = 0.885
>>
So, fitglm will accept tall arrays; the syntax must be else where it would seem...

Sign in to comment.

 Accepted Answer

Ive J
Ive J on 13 Jul 2021
Edited: Ive J on 13 Jul 2021
Well, your data is tall table, and that's what MATLAB complains about: since your first argument is a table, MATLAB thinks y is modelspec. You have two options:
% 1-feed fitglm with matrix
mdl = fitglm(x{:, :}, y{:, :}, 'Link', 'logit', 'Distribution', 'binomial');
% 2-OR: merge x and y as a table
data = [x, y]; % last column is the dependent variable by default
mdl = fitglm(data, 'Link', 'logit', 'Distribution', 'binomial');
Btw, your data is fairly small and (I assume) fits within memory, tall arrays should be avoided for such small datasets.

2 Comments

Hi Ive,
I merged the x and y tables and converted the new table before building the tall array with:
ds = transform(ds,@table2array);
Now it works, Thanks for your help!
PS: the file here was was only a smaller sample. The "real" one is 320000x30000.
If I were you I would also test with arrays. Processing tables is almost always (based on my experience) slower than arrays.
Good luck!

Sign in to comment.

More Answers (0)

Categories

Asked:

on 12 Jul 2021

Edited:

on 1 Aug 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!