Vectorizing a repeated regression

3 views (last 30 days)
Jelle
Jelle on 29 Jan 2018
Answered: Jelle on 31 Jan 2018
Hi All,
My apologies. Yet another question on how to vectorize a calculation. What I am trying to do, is recreate a script I wrote 15 years (!) ago in IDL, in Matlab, full description of the why and how here https://doi.org/10.1080/01431160500497119.
In short: I am bootstrapping a stepwise regression model, selecting the best predictor on statistics within the bootstrap. The current script results in the sort of output I am looking for. However, running it takes over an hour for just one response factor & one dataset. Taking that I have 10 response factors, and several datasets in this one study alone, I need this to become a lot quicker so it can become part of my standard data processing.
I know vectorizing is the way to go, but being a newbe, I cannot get my head around how to handle this. Someone willing to lend me a hand? I think that the actual regression is the one to be optimized, vectorizing for the "band" loop (Is that how you would refer to it?):
model = fitglm(predictors(train, [selectedbands,band]), response(train));
stats(thispredictor,band) = model.Rsquared.Ordinary;
I am not sure whether it is possible though: There are about 2000 wavebands in there. In other words: for each bootstrap iteration, 2k regressions are run. And I am running 10k iterations for each predictor.
The process it goes through (Full function in the attached file):
% Create the regression model one predictor at a time,
% as I want to bootstrap the sensitivity of each predictor
for thispredictor = 1:maxpredictors
% Bootstrap loop
for iteration = 1:iterations
% create the statistics for each iteration
stats = zeros(maxpredictors,numbands);
outdata = zeros(numbands);
% create a subset for training and testing
train = datasample(datarange, numsamples);
% Correlation at each waveband
for band = 1:numbands
% Set the dataset
predictorset = predictors(train, [selectedbands,band]);
responseset = response(train);
model = fitglm(predictors(train, [selectedbands,band]), response(train));
stats(thispredictor,band) = model.Rsquared.Ordinary;
end %band loop
% Select the best band for the model
[maxrsq,maxband] = max(stats(thispredictor,:));
bandcount(thispredictor,maxband) = bandcount(thispredictor,maxband)+1;
end %iterations loop
[frequency, peakband] = max(bandcount(thispredictor,:));
selectedbands = [selectedbands,peakband];
end %predictors loop

Accepted Answer

Jelle
Jelle on 31 Jan 2018
Continued to crunch on this and I have found a way to make this work using the / operator. To avoid having someone help me even though I solved it, the below code needs some cleaning but this is already 200 times as fast as the original, and by the looks of things giving me the correct output (Still testing):
% Set the dataset
predictorset = [predictors(train, [selectedbands,band]) ones(numsamples,1)];
responseset = response(train);
modelsolution = responseset'/predictorset';
predictedy = modelsolution * predictorset';
RSQ = 1 - sum ( (responseset - predictedy').^2)/sum((responseset - mean (responseset)).^2);
stats(thispredictor,band) = RSQ;

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!