Code
library(tidyverse)
= "https://github.com//zia207/r-colab/raw/main/Data/USA/gp_soil_data.csv"
urlfile <-read_csv(url(urlfile)) mf
Stepwise regression analysis is a statistical method used to select a subset of predictor variables that best explain the variation in a response variable. It involves adding or removing variables from a multiple regression model based on their contribution to the model’s overall predictive power.
The two main types of stepwise regression are forward stepwise regression and backward stepwise regression. In forward stepwise regression, variables are added to the model one at a time, starting with the variable that has the strongest correlation with the response variable. In backward stepwise regression, variables are removed from the model one at a time, starting with the variable that has the weakest correlation with the response variable. Bidirectional Stepwise Regression is a combination of forward selection (for adding significant terms) and backward selection (for removing nonsignificant terms). As in forward selection, we start with only the intercept and add the most significant term to the model. We continue to add the most significant variables, one at a time. We use a p-value threshold to determine when to stop adding terms to the model.
In this exercise we will use following data set.
First, we create a dataframe with five predictors (‘DEM’, ‘Slope’, ‘MAT’, ‘MAP’,‘NDVI’) and create a training and test data set.
There are many functions and R packages for computing stepwise regression. These include:
The MASS package in R provides a function called stepAIC() that can be used to perform stepwise regression analysis with either forward selection, backward elimination, or both.
install.packages(“MASS”)
1st you have to fit a MLR model with all predictors:
Analysis of Variance Table
Response: SOC
Df Sum Sq Mean Sq F value Pr(>F)
DEM 1 329.6 329.61 22.686 2.559e-06 ***
Slope 1 1941.4 1941.39 133.621 < 2.2e-16 ***
MAT 1 1180.8 1180.77 81.270 < 2.2e-16 ***
MAP 1 1458.0 1458.03 100.352 < 2.2e-16 ***
NDVI 1 254.9 254.88 17.543 3.367e-05 ***
Residuals 461 6697.9 14.53
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
stepAIC() function in MASS package, which choose the best model by Akaike Information Criterion (AIC). It has an option named direction, which can take the following values:
both - for stepwise regression, both forward and backward selection
backward- for backward selection
forward - for forward selection)
Call:
lm(formula = SOC ~ Slope + MAT + MAP + NDVI, data = df)
Residuals:
Min 1Q Median 3Q Max
-13.6203 -2.1059 -0.4424 1.3312 16.2115
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.088559 0.845323 1.288 0.19848
Slope 0.139063 0.050332 2.763 0.00596 **
MAT -0.284372 0.059988 -4.741 2.84e-06 ***
MAP 0.006439 0.001565 4.115 4.59e-05 ***
NDVI 8.903651 2.041594 4.361 1.60e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.81 on 462 degrees of freedom
Multiple R-squared: 0.4348, Adjusted R-squared: 0.4299
F-statistic: 88.84 on 4 and 462 DF, p-value: < 2.2e-16
we can explain the output with report package:
We fitted a linear model (estimated using OLS) to predict SOC with Slope, MAT,
MAP and NDVI (formula: SOC ~ Slope + MAT + MAP + NDVI). The model explains a
statistically significant and substantial proportion of variance (R2 = 0.43,
F(4, 462) = 88.84, p < .001, adj. R2 = 0.43). The model's intercept,
corresponding to Slope = 0, MAT = 0, MAP = 0 and NDVI = 0, is at 1.09 (95% CI
[-0.57, 2.75], t(462) = 1.29, p = 0.198). Within this model:
- The effect of Slope is statistically significant and positive (beta = 0.14,
95% CI [0.04, 0.24], t(462) = 2.76, p = 0.006; Std. beta = 0.13, 95% CI [0.04,
0.22])
- The effect of MAT is statistically significant and negative (beta = -0.28,
95% CI [-0.40, -0.17], t(462) = -4.74, p < .001; Std. beta = -0.23, 95% CI
[-0.33, -0.14])
- The effect of MAP is statistically significant and positive (beta = 6.44e-03,
95% CI [3.36e-03, 9.51e-03], t(462) = 4.11, p < .001; Std. beta = 0.26, 95% CI
[0.14, 0.39])
- The effect of NDVI is statistically significant and positive (beta = 8.90,
95% CI [4.89, 12.92], t(462) = 4.36, p < .001; Std. beta = 0.29, 95% CI [0.16,
0.41])
Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.
The leap package in R provides functions for performing stepwise regression analysis using various criteria for model selection, including AIC, BIC, and adjusted R-squared
regsubsets() from leaps package, which has the tuning parameter nvmax specifying the maximal number of predictors to incorporate in the model. It returns multiple models with different size up to nvmax. We need to compare the performance of the different models for choosing the best one. regsubsets() has the option method, which can take the values backward, forward and seqrep (seqrep = sequential replacement, combination of forward and backward selections).
install.packages(“leaps”)
Subset selection object
Call: regsubsets.formula(SOC ~ ., data = df, nvmax = 5, method = "seqrep")
5 Variables (and intercept)
Forced in Forced out
DEM FALSE FALSE
Slope FALSE FALSE
MAT FALSE FALSE
MAP FALSE FALSE
NDVI FALSE FALSE
1 subsets of each size up to 5
Selection Algorithm: 'sequential replacement'
DEM Slope MAT MAP NDVI
1 ( 1 ) " " " " " " " " "*"
2 ( 1 ) " " " " "*" " " "*"
3 ( 1 ) " " " " "*" "*" "*"
4 ( 1 ) " " "*" "*" "*" "*"
5 ( 1 ) "*" "*" "*" "*" "*"
[[1]]
(Intercept) NDVI
-1.643239 18.299763
[[2]]
(Intercept) MAT NDVI
1.7404457 -0.3019544 16.6916581
[[3]]
(Intercept) MAT MAP NDVI
2.312513577 -0.384202169 0.006789594 9.269312903
[[4]]
(Intercept) Slope MAT MAP NDVI
1.08855877 0.13906306 -0.28437213 0.00643859 8.90365110
[[5]]
(Intercept) DEM Slope MAT MAP
2.2088309462 -0.0003852617 0.1660844917 -0.3239458581 0.0061025986
NDVI
8.6687409551
The variance-covariance matrix (also known as the covariance matrix or the dispersion matrix) is a matrix that summarizes the variances and covariances of a set of variables. In statistical analysis, it is used to describe the relationships between two or more variables.
(Intercept) DEM Slope MAT
(Intercept) 3.2407929391 -8.685001e-04 3.859343e-02 -1.294394e-01
DEM -0.0008685001 2.986772e-07 -2.094859e-05 3.067986e-05
Slope 0.0385934317 -2.094859e-05 4.005389e-03 -3.312149e-04
MAT -0.1294394368 3.067986e-05 -3.312149e-04 6.753831e-03
MAP -0.0004959253 2.604798e-07 -2.467079e-05 -7.335917e-06
NDVI -1.3307235274 1.821160e-04 -1.944182e-02 5.648719e-02
MAP NDVI
(Intercept) -4.959253e-04 -1.330723527
DEM 2.604798e-07 0.000182116
Slope -2.467079e-05 -0.019441820
MAT -7.335917e-06 0.056487186
MAP 2.678305e-06 -0.002486248
NDVI -2.486248e-03 4.283693218
The R package caret (Classification And REgression Training) is a widely used package in machine learning for building predictive models. It provides a unified interface for performing data preparation, feature selection, model tuning, and model evaluation.
The train()* function caret** package provides an easy workflow to perform stepwise selections using the leaps and the MASS packages. It has an option named method, which can take the following values:
leapBackward, to fit linear regression with backward selection
leapForward, to fit linear regression with forward selection
leapSeq, to fit linear regression with stepwise selection
install.package(“caret”)
library(caret)
# Set seed for reproducibility
set.seed(123)
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model.caret.leaps <- train(SOC ~., data = df,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:5),
trControl = train.control
)
model.caret.leaps$results
nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 4.318065 0.2878492 3.054081 0.3522772 0.07295631 0.2556071
2 2 4.018065 0.3868724 2.864126 0.3538864 0.11208172 0.2817436
3 3 3.887540 0.4245143 2.768027 0.3838066 0.11380657 0.2566427
4 4 3.830983 0.4399608 2.716350 0.3703366 0.12380366 0.2705464
5 5 3.831526 0.4399150 2.717833 0.3792621 0.12552049 0.2801233
The function summary() reports the best set of variables for each model size, up to the best 4-variables model.
Subset selection object
5 Variables (and intercept)
Forced in Forced out
DEM FALSE FALSE
Slope FALSE FALSE
MAT FALSE FALSE
MAP FALSE FALSE
NDVI FALSE FALSE
1 subsets of each size up to 4
Selection Algorithm: backward
DEM Slope MAT MAP NDVI
1 ( 1 ) " " " " " " " " "*"
2 ( 1 ) " " " " "*" " " "*"
3 ( 1 ) " " " " "*" "*" "*"
4 ( 1 ) " " "*" "*" "*" "*"
An asterisk specifies that a given variable is included in the corresponding model. For example, it can be seen that the best 5-variables model contains Slope, MAT, MAP, NDVI.
Additionally, the caret package has method to compute stepwise regression using the MASS package (method = “lmStepAIC”):
parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 none 3.791739 0.4420481 2.721031 0.6487727 0.09320663 0.4010154
Call:
lm(formula = .outcome ~ Slope + MAT + MAP + NDVI, data = dat)
Coefficients:
(Intercept) Slope MAT MAP NDVI
1.088559 0.139063 -0.284372 0.006439 8.903651
Call:
lm(formula = .outcome ~ Slope + MAT + MAP + NDVI, data = dat)
Residuals:
Min 1Q Median 3Q Max
-13.6203 -2.1059 -0.4424 1.3312 16.2115
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.088559 0.845323 1.288 0.19848
Slope 0.139063 0.050332 2.763 0.00596 **
MAT -0.284372 0.059988 -4.741 2.84e-06 ***
MAP 0.006439 0.001565 4.115 4.59e-05 ***
NDVI 8.903651 2.041594 4.361 1.60e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.81 on 462 degrees of freedom
Multiple R-squared: 0.4348, Adjusted R-squared: 0.4299
F-statistic: 88.84 on 4 and 462 DF, p-value: < 2.2e-16
We can compare the performance the MLR models with sub-set of predictors compare_performance() function.
# Comparison of Model Performance Indices
Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2 | R2 (adj.) | RMSE | Sigma
---------------------------------------------------------------------------------------------------
..1 | lm | 2583.0 (0.321) | 2583.3 (0.315) | 2612.0 (0.056) | 0.435 | 0.429 | 3.787 | 3.812
..3 | lm | 2581.5 (0.679) | 2581.7 (0.685) | 2606.4 (0.944) | 0.435 | 0.430 | 3.789 | 3.810
Create a R-Markdown documents (name homework_07.rmd) in this project and do all Tasks using the data shown below.
Submit all codes and output as a HTML document (homework_07.html) before class of next week.
tidyverse, broom, report, performance, ggmisc, jtools, MASS, leap, caret
Download the data and save in your project directory. Use read_csv to load the data in your R-session. For example:
df<-read_csv(“bd_arsenic_data.csv”))
WAs: Groundwater As
WP: Groundwater P
WFe: Groundwater Fe
WEc: Groundwater Ec
WpH: Groundwater pH
SAoFe: Oxalate Extractable
SpH: Soil pH
SEc: Soil Ec
SOC: Soil Organic Carbon
SAND: Sand
SILT: Silt
CLAY: Clay
SP: Soil pH
ELEV: Elevation
Year_Irrig: Year of Irrigation
DIS_STW: Distance from STW
Use MASS and leap packages to select subset of optimum number of predictors and fit the models.
Compare the performance the MLR models with sub-set of predictors by both MASS and leap packages . Use full model as a base-line mode.
Use Stepwise regression with caret package, try both leapBackward and lmStepAIC
# Step-wise Regression Analysis {.unnumbered}
Stepwise regression analysis is a statistical method used to select a subset of predictor variables that best explain the variation in a response variable. It involves adding or removing variables from a multiple regression model based on their contribution to the model's overall predictive power.
The two main types of stepwise regression are forward stepwise regression and backward stepwise regression. In forward stepwise regression, variables are added to the model one at a time, starting with the variable that has the strongest correlation with the response variable. In backward stepwise regression, variables are removed from the model one at a time, starting with the variable that has the weakest correlation with the response variable. Bidirectional Stepwise Regression is a combination of forward selection (for adding significant terms) and backward selection (for removing nonsignificant terms). As in forward selection, we start with only the intercept and add the most significant term to the model. We continue to add the most significant variables, one at a time. We use a p-value threshold to determine when to stop adding terms to the model.
### Data
In this exercise we will use following data set.
[gp_soil_data.csv](https://www.dropbox.com/s/9ikm5yct36oflei/gp_soil_data.csv?dl=0)
```{r}
#| warning: false
#| error: false
library(tidyverse)
urlfile = "https://github.com//zia207/r-colab/raw/main/Data/USA/gp_soil_data.csv"
mf<-read_csv(url(urlfile))
```
First, we create a dataframe with five predictors ('DEM', 'Slope', 'MAT', 'MAP','NDVI') and create a training and test data set.
```{r}
#| warning: false
#| error: false
# Create a data-frame
df<-mf %>% dplyr::select(SOC, DEM, Slope, MAT, MAP,NDVI)
```
There are many functions and R packages for computing stepwise regression. These include:
### Stepwise Regression with MASS Package
The MASS package in R provides a function called stepAIC() that can be used to perform stepwise regression analysis with either forward selection, backward elimination, or both.
> install.packages("MASS")
#### Full Model
1st you have to fit a MLR model with all predictors:
```{r }
#| warning: false
#| error: false
model.full <- lm(SOC ~., data = df)
anova(model.full)
```
#### Choose a model by AIC in a Stepwise Algorithm
**stepAIC()** function in **MASS package**, which choose the best model by **Akaike Information Criterion (AIC)**. It has an option named **direction**, which can take the following values:
**both** - for stepwise regression, both forward and backward selection
**backward**- for backward selection
**forward** - for forward selection)
```{r }
#| warning: false
#| error: false
# Stepwise regression model both forward and backward selection
model.MASS <- MASS::stepAIC(model.full, direction = "both",
trace = FALSE)
summary(model.MASS)
```
we can explain the output with report package:
```{r}
#| warning: false
#| error: false
library(report)
report::report(model.MASS)
```
### Stepwise Regression with leap Package
The leap package in R provides functions for performing stepwise regression analysis using various criteria for model selection, including AIC, BIC, and adjusted R-squared
#### Model selection by exhaustive search, forward or backward stepwise, or sequential replacement
**regsubsets()** from **leaps** package, which has the tuning parameter **nvmax** specifying the maximal number of predictors to incorporate in the model. It returns multiple models with different size up to **nvmax**. We need to compare the performance of the different models for choosing the best one. **regsubsets()** has the option method, which can take the values **backward**, **forward** and **seqrep** (seqrep = sequential replacement, combination of forward and backward selections).
> install.packages("leaps")
```{r }
#| warning: false
#| error: false
library(leaps)
model.leaps <- leaps::regsubsets(SOC~., data =df, nvmax = 5,
method = "seqrep")
summary(model.leaps)
```
#### Extract Model Coefficients
```{r }
#| warning: false
#| error: false
coef(model.leaps, 1:5)
```
#### Variance-Covariance Matrix
The variance-covariance matrix (also known as the covariance matrix or the dispersion matrix) is a matrix that summarizes the variances and covariances of a set of variables. In statistical analysis, it is used to describe the relationships between two or more variables.
```{r}
#| warning: false
#| error: false
vcov(model.leaps, 5)
```
### Stepwise regression with caret package
The R package [caret (Classification And REgression Training)](https://cran.r-project.org/web/packages/caret/vignettes/caret.html) is a widely used package in machine learning for building predictive models. It provides a unified interface for performing data preparation, feature selection, model tuning, and model evaluation.
The **train()\* function** caret\*\* package provides an easy workflow to perform stepwise selections using the **leaps** and the **MASS** packages. It has an option named method, which can take the following values:
**leapBackward**, to fit linear regression with backward selection
**leapForward**, to fit linear regression with forward selection
**leapSeq**, to fit linear regression with stepwise selection
> install.package("caret")
#### leapBackward
```{r}
#| warning: false
#| error: false
library(caret)
# Set seed for reproducibility
set.seed(123)
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model.caret.leaps <- train(SOC ~., data = df,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:5),
trControl = train.control
)
model.caret.leaps$results
```
The function **summary()** reports the best set of variables for each model size, up to the best 4-variables model.
```{r }
#| warning: false
#| error: false
summary(model.caret.leaps$finalModel)
```
An asterisk specifies that a given variable is included in the corresponding model. For example, it can be seen that the best 5-variables model contains Slope, MAT, MAP, NDVI.
```{r}
#| warning: false
#| error: false
coef(model.caret.leaps$finalModel, 4)
```
#### lmStepAIC
Additionally, the caret package has method to compute stepwise regression using the MASS package **(method = "lmStepAIC")**:
```{r }
#| warning: false
#| error: false
# Train the model
model.caret.mass <- train(SOC ~., data = df,
method = "lmStepAIC",
trControl = train.control,
trace = FALSE
)
```
```{r}
#| warning: false
#| error: false
# Model accuracy
model.caret.mass$results
```
```{r}
#| warning: false
#| error: false
# Final model coefficients
model.caret.mass$finalModel
```
```{r}
#| warning: false
#| error: false
# Summary of the model
summary(model.caret.mass$finalModel)
```
### Compare the Models
We can compare the performance the MLR models with sub-set of predictors compare_performance() function.
```{r }
#| warning: false
#| error: false
performance::compare_performance(model.full, model.caret.leaps$finalModel, model.caret.mass$finalModel)
```
### Exercise
1. Create a R-Markdown documents (name homework_07.rmd) in this project and do all Tasks using the data shown below.
2. Submit all codes and output as a HTML document (homework_07.html) before class of next week.
#### Required R-Package
tidyverse, broom, report, performance, ggmisc, jtools, MASS, leap, caret
#### Data
1. [bd_arsenic_data.csv](https://www.dropbox.com/s/em5jqa78n3c6skf/bd_arsenic_data.csv?dl=0)
Download the data and save in your project directory. Use read_csv to load the data in your R-session. For example:
> df\<-read_csv("bd_arsenic_data.csv"))
#### Tasks
1. Create a full model MLR for predicting Soil As (SAs) with following variables:
- WAs: Groundwater As
- WP: Groundwater P
- WFe: Groundwater Fe
- WEc: Groundwater Ec
- WpH: Groundwater pH
- SAoFe: Oxalate Extractable
- SpH: Soil pH
- SEc: Soil Ec
- SOC: Soil Organic Carbon
- SAND: Sand
- SILT: Silt
- CLAY: Clay
- SP: Soil pH
- ELEV: Elevation
- Year_Irrig: Year of Irrigation
- DIS_STW: Distance from STW
2. Use MASS and leap packages to select subset of optimum number of predictors and fit the models.
3. Compare the performance the MLR models with sub-set of predictors by both MASS and leap packages . Use full model as a base-line mode.
4. Use Stepwise regression with caret package, try both leapBackward and lmStepAIC
### Further Reading
1. [Model Selection Essentials in R](http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-r/)
2. [Variable Selection in Multiple Regression](https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/variable-selection.html)
3. [Akaike Information Criterion](https://www.scribbr.com/statistics/akaike-information-criterion/)
4. [caret](https://cran.r-project.org/web/packages/caret/vignettes/caret.html)