How to improve regression models for a dataset with too many variables?
5 views (last 30 days)
Show older comments
Hi. I'm quite new with Machine Learning and my problem is to fit a regression model (either linear or non-linear).
My X data is spectrophotometric data with 117 observations and 15956 variables. My Y data is 117-by-1.
I have tried most models I could figure out myself including one-way PLS, N-way PLS, neural network, regression tree and bagging ensemble. However, while PLS models are underfitting, the others are overfitting and my RPD have never exceeded 2.0. I realized that it's probably because I have too many variables compared to few observations. Is there a way for me to improve the models without reducing the dimensions (like using only PCA coefficients to do the regression)?
Thank you.
5 Comments
dpb
on 10 Aug 2022
Well, you came here asking for help -- can't help unless have some idea about what it is we're trying to help with...
With that many variables and so few observation, there's bound to be correlation just by random chance.
Accepted Answer
the cyclist
on 10 Aug 2022
@NC_, as I expect you realize, your question is not really a MATLAB question, but is a generic machine learning question. You are asking how to handle the ML problem of "p << n", where p is the number of features, and n is the number of observations. It's common in certain domains (e.g. analyses that use gene expression as features).
There are lots of ways to approach the problem, and it doesn't make sense to try to bring all of that into this forum. I'm not really an ML expert myself, so maybe can't give you the best pointers, but this page gives a good overview of the issue, and has references for more info.
2 Comments
the cyclist
on 10 Aug 2022
I don't want to leave you with the impression that MATLAB doesn't have the tools to handle this type of problem. For example, you can read more at this documentation page about feature selection. But you've stated that you don't want to reduce the dimension.
But, given the fact that the biggest risk in p<<n problems is overfitting, you almost always have to do something akin to feature reduction (or at least regularization). So, I continue to think that you don't quite yet really have a MATLAB question. Especially since you are working on a research problem, you need to get your arms wrapped around the theory of what you are doing first.
More Answers (0)
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!