# Regression Using Dataset Arrays

This example shows how to perform linear and stepwise regression analyses using dataset arrays.

`load imports-85`

### Store predictor and response variables in dataset array.

```ds = dataset(X(:,7),X(:,8),X(:,9),X(:,15),'Varnames',... {'curb_weight','engine_size','bore','price'});```

### Fit linear regression model.

Fit a linear regression model that explains the price of a car in terms of its curb weight, engine size, and bore.

`fitlm(ds,'price~curb_weight+engine_size+bore')`
```ans = Linear regression model: price ~ 1 + curb_weight + engine_size + bore Estimated Coefficients: Estimate SE tStat pValue __________ _________ _______ __________ (Intercept) 64.095 3.703 17.309 2.0481e-41 curb_weight -0.0086681 0.0011025 -7.8623 2.42e-13 engine_size -0.015806 0.013255 -1.1925 0.23452 bore -2.6998 1.3489 -2.0015 0.046711 Number of observations: 201, Error degrees of freedom: 197 Root Mean Squared Error: 3.95 R-squared: 0.674, Adjusted R-Squared: 0.669 F-statistic vs. constant model: 136, p-value = 1.14e-47 ```

The command `fitlm(ds)` also returns the same result because `fitlm`, by default, assumes the predictor variable is in the last column of the dataset array `ds`.

### Recreate dataset array and repeat analysis.

This time, put the response variable in the first column of the dataset array.

``` ds = dataset(X(:,15),X(:,7),X(:,8),X(:,9),'Varnames',... {'price','curb_weight','engine_size','bore'});```

When the response variable is in the first column of `ds`, define its location. For example, `fitlm`, by default, assumes that `bore` is the response variable. You can define the response variable in the model using either:

`fitlm(ds,'ResponseVar','price');`

or

`fitlm(ds,'ResponseVar',logical([1 0 0 0]));`

### Perform stepwise regression.

```stepwiselm(ds,'quadratic','lower','price~1',... 'ResponseVar','price')```
```1. Removing bore^2, FStat = 0.01282, pValue = 0.90997 2. Removing engine_size^2, FStat = 0.078043, pValue = 0.78027 3. Removing curb_weight:bore, FStat = 0.70558, pValue = 0.40195 ```
```ans = Linear regression model: price ~ 1 + curb_weight*engine_size + engine_size*bore + curb_weight^2 Estimated Coefficients: Estimate SE tStat pValue ___________ __________ _______ __________ (Intercept) 131.13 14.273 9.1873 6.2319e-17 curb_weight -0.043315 0.0085114 -5.0891 8.4682e-07 engine_size -0.17102 0.13844 -1.2354 0.21819 bore -12.244 4.999 -2.4493 0.015202 curb_weight:engine_size -6.3411e-05 2.6577e-05 -2.386 0.017996 engine_size:bore 0.092554 0.037263 2.4838 0.013847 curb_weight^2 8.0836e-06 1.9983e-06 4.0451 7.5432e-05 Number of observations: 201, Error degrees of freedom: 194 Root Mean Squared Error: 3.59 R-squared: 0.735, Adjusted R-Squared: 0.726 F-statistic vs. constant model: 89.5, p-value = 3.58e-53 ```

The initial model is a quadratic formula, and the lowest model considered is the constant. Here, `stepwiselm` performs a backward elimination technique to determine the terms in the model. The final model is `price ~ 1 + curb_weight*engine_size + engine_size*bore + curb_weight^2`, which corresponds to

`$P={\beta }_{0}+{\beta }_{C}C+{\beta }_{E}E+{\beta }_{B}B+{\beta }_{CE}CE+{\beta }_{EB}EB+{\beta }_{{C}^{2}}{C}^{2}+ϵ$`

where $P$ is price, $C$ is curb weight, $E$ is engine size, $B$ is bore, ${\beta }_{i}$ is the coefficient for the corresponding term in the model, and $ϵ$ is the error term. The final model includes all three main effects, the interaction effects for curb weight and engine size and engine size and bore, and the second-order term for curb weight.