Step-wise Regression Analysis

Stepwise regression analysis is a statistical method used to select a subset of predictor variables that best explain the variation in a response variable. It involves adding or removing variables from a multiple regression model based on their contribution to the model’s overall predictive power.

The two main types of stepwise regression are forward stepwise regression and backward stepwise regression. In forward stepwise regression, variables are added to the model one at a time, starting with the variable that has the strongest correlation with the response variable. In backward stepwise regression, variables are removed from the model one at a time, starting with the variable that has the weakest correlation with the response variable. Bidirectional Stepwise Regression is a combination of forward selection (for adding significant terms) and backward selection (for removing nonsignificant terms). As in forward selection, we start with only the intercept and add the most significant term to the model. We continue to add the most significant variables, one at a time. We use a p-value threshold to determine when to stop adding terms to the model.

Data

In this exercise we will use following data set.

gp_soil_data.csv

Code
library(tidyverse)
urlfile = "https://github.com//zia207/r-colab/raw/main/Data/USA/gp_soil_data.csv"
mf<-read_csv(url(urlfile))

First, we create a dataframe with five predictors (‘DEM’, ‘Slope’, ‘MAT’, ‘MAP’,‘NDVI’) and create a training and test data set.

Code
# Create a data-frame
df<-mf %>% dplyr::select(SOC, DEM, Slope, MAT, MAP,NDVI)

There are many functions and R packages for computing stepwise regression. These include:

Stepwise Regression with MASS Package

The MASS package in R provides a function called stepAIC() that can be used to perform stepwise regression analysis with either forward selection, backward elimination, or both.

install.packages(“MASS”)

Full Model

1st you have to fit a MLR model with all predictors:

Code
model.full <- lm(SOC ~., data = df)
anova(model.full)
Analysis of Variance Table

Response: SOC
           Df Sum Sq Mean Sq F value    Pr(>F)    
DEM         1  329.6  329.61  22.686 2.559e-06 ***
Slope       1 1941.4 1941.39 133.621 < 2.2e-16 ***
MAT         1 1180.8 1180.77  81.270 < 2.2e-16 ***
MAP         1 1458.0 1458.03 100.352 < 2.2e-16 ***
NDVI        1  254.9  254.88  17.543 3.367e-05 ***
Residuals 461 6697.9   14.53                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Choose a model by AIC in a Stepwise Algorithm

stepAIC() function in MASS package, which choose the best model by Akaike Information Criterion (AIC). It has an option named direction, which can take the following values:

both - for stepwise regression, both forward and backward selection

backward- for backward selection

forward - for forward selection)

Code
# Stepwise regression model both forward and backward selection 
model.MASS <- MASS::stepAIC(model.full, direction = "both", 
                      trace = FALSE)
summary(model.MASS)

Call:
lm(formula = SOC ~ Slope + MAT + MAP + NDVI, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.6203  -2.1059  -0.4424   1.3312  16.2115 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.088559   0.845323   1.288  0.19848    
Slope        0.139063   0.050332   2.763  0.00596 ** 
MAT         -0.284372   0.059988  -4.741 2.84e-06 ***
MAP          0.006439   0.001565   4.115 4.59e-05 ***
NDVI         8.903651   2.041594   4.361 1.60e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.81 on 462 degrees of freedom
Multiple R-squared:  0.4348,    Adjusted R-squared:  0.4299 
F-statistic: 88.84 on 4 and 462 DF,  p-value: < 2.2e-16

we can explain the output with report package:

Code
library(report)
report::report(model.MASS)
We fitted a linear model (estimated using OLS) to predict SOC with Slope, MAT,
MAP and NDVI (formula: SOC ~ Slope + MAT + MAP + NDVI). The model explains a
statistically significant and substantial proportion of variance (R2 = 0.43,
F(4, 462) = 88.84, p < .001, adj. R2 = 0.43). The model's intercept,
corresponding to Slope = 0, MAT = 0, MAP = 0 and NDVI = 0, is at 1.09 (95% CI
[-0.57, 2.75], t(462) = 1.29, p = 0.198). Within this model:

  - The effect of Slope is statistically significant and positive (beta = 0.14,
95% CI [0.04, 0.24], t(462) = 2.76, p = 0.006; Std. beta = 0.13, 95% CI [0.04,
0.22])
  - The effect of MAT is statistically significant and negative (beta = -0.28,
95% CI [-0.40, -0.17], t(462) = -4.74, p < .001; Std. beta = -0.23, 95% CI
[-0.33, -0.14])
  - The effect of MAP is statistically significant and positive (beta = 6.44e-03,
95% CI [3.36e-03, 9.51e-03], t(462) = 4.11, p < .001; Std. beta = 0.26, 95% CI
[0.14, 0.39])
  - The effect of NDVI is statistically significant and positive (beta = 8.90,
95% CI [4.89, 12.92], t(462) = 4.36, p < .001; Std. beta = 0.29, 95% CI [0.16,
0.41])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

Stepwise Regression with leap Package

The leap package in R provides functions for performing stepwise regression analysis using various criteria for model selection, including AIC, BIC, and adjusted R-squared

Model selection by exhaustive search, forward or backward stepwise, or sequential replacement

regsubsets() from leaps package, which has the tuning parameter nvmax specifying the maximal number of predictors to incorporate in the model. It returns multiple models with different size up to nvmax. We need to compare the performance of the different models for choosing the best one. regsubsets() has the option method, which can take the values backward, forward and seqrep (seqrep = sequential replacement, combination of forward and backward selections).

install.packages(“leaps”)

Code
library(leaps)
model.leaps <- leaps::regsubsets(SOC~., data =df, nvmax = 5,
                     method = "seqrep")
summary(model.leaps)
Subset selection object
Call: regsubsets.formula(SOC ~ ., data = df, nvmax = 5, method = "seqrep")
5 Variables  (and intercept)
      Forced in Forced out
DEM       FALSE      FALSE
Slope     FALSE      FALSE
MAT       FALSE      FALSE
MAP       FALSE      FALSE
NDVI      FALSE      FALSE
1 subsets of each size up to 5
Selection Algorithm: 'sequential replacement'
         DEM Slope MAT MAP NDVI
1  ( 1 ) " " " "   " " " " "*" 
2  ( 1 ) " " " "   "*" " " "*" 
3  ( 1 ) " " " "   "*" "*" "*" 
4  ( 1 ) " " "*"   "*" "*" "*" 
5  ( 1 ) "*" "*"   "*" "*" "*" 

Extract Model Coefficients

Code
coef(model.leaps, 1:5)
[[1]]
(Intercept)        NDVI 
  -1.643239   18.299763 

[[2]]
(Intercept)         MAT        NDVI 
  1.7404457  -0.3019544  16.6916581 

[[3]]
 (Intercept)          MAT          MAP         NDVI 
 2.312513577 -0.384202169  0.006789594  9.269312903 

[[4]]
(Intercept)       Slope         MAT         MAP        NDVI 
 1.08855877  0.13906306 -0.28437213  0.00643859  8.90365110 

[[5]]
  (Intercept)           DEM         Slope           MAT           MAP 
 2.2088309462 -0.0003852617  0.1660844917 -0.3239458581  0.0061025986 
         NDVI 
 8.6687409551 

Variance-Covariance Matrix

The variance-covariance matrix (also known as the covariance matrix or the dispersion matrix) is a matrix that summarizes the variances and covariances of a set of variables. In statistical analysis, it is used to describe the relationships between two or more variables.

Code
vcov(model.leaps, 5)
              (Intercept)           DEM         Slope           MAT
(Intercept)  3.2407929391 -8.685001e-04  3.859343e-02 -1.294394e-01
DEM         -0.0008685001  2.986772e-07 -2.094859e-05  3.067986e-05
Slope        0.0385934317 -2.094859e-05  4.005389e-03 -3.312149e-04
MAT         -0.1294394368  3.067986e-05 -3.312149e-04  6.753831e-03
MAP         -0.0004959253  2.604798e-07 -2.467079e-05 -7.335917e-06
NDVI        -1.3307235274  1.821160e-04 -1.944182e-02  5.648719e-02
                      MAP         NDVI
(Intercept) -4.959253e-04 -1.330723527
DEM          2.604798e-07  0.000182116
Slope       -2.467079e-05 -0.019441820
MAT         -7.335917e-06  0.056487186
MAP          2.678305e-06 -0.002486248
NDVI        -2.486248e-03  4.283693218

Stepwise regression with caret package

The R package caret (Classification And REgression Training) is a widely used package in machine learning for building predictive models. It provides a unified interface for performing data preparation, feature selection, model tuning, and model evaluation.

The train()* function caret** package provides an easy workflow to perform stepwise selections using the leaps and the MASS packages. It has an option named method, which can take the following values:

leapBackward, to fit linear regression with backward selection

leapForward, to fit linear regression with forward selection

leapSeq, to fit linear regression with stepwise selection

install.package(“caret”)

leapBackward

Code
library(caret)
# Set seed for reproducibility
set.seed(123)
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model.caret.leaps <- train(SOC ~., data = df,
                    method = "leapBackward", 
                    tuneGrid = data.frame(nvmax = 1:5),
                    trControl = train.control
                    )
model.caret.leaps$results
  nvmax     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1     1 4.318065 0.2878492 3.054081 0.3522772 0.07295631 0.2556071
2     2 4.018065 0.3868724 2.864126 0.3538864 0.11208172 0.2817436
3     3 3.887540 0.4245143 2.768027 0.3838066 0.11380657 0.2566427
4     4 3.830983 0.4399608 2.716350 0.3703366 0.12380366 0.2705464
5     5 3.831526 0.4399150 2.717833 0.3792621 0.12552049 0.2801233

The function summary() reports the best set of variables for each model size, up to the best 4-variables model.

Code
summary(model.caret.leaps$finalModel)
Subset selection object
5 Variables  (and intercept)
      Forced in Forced out
DEM       FALSE      FALSE
Slope     FALSE      FALSE
MAT       FALSE      FALSE
MAP       FALSE      FALSE
NDVI      FALSE      FALSE
1 subsets of each size up to 4
Selection Algorithm: backward
         DEM Slope MAT MAP NDVI
1  ( 1 ) " " " "   " " " " "*" 
2  ( 1 ) " " " "   "*" " " "*" 
3  ( 1 ) " " " "   "*" "*" "*" 
4  ( 1 ) " " "*"   "*" "*" "*" 

An asterisk specifies that a given variable is included in the corresponding model. For example, it can be seen that the best 5-variables model contains Slope, MAT, MAP, NDVI.

Code
coef(model.caret.leaps$finalModel, 4)
(Intercept)       Slope         MAT         MAP        NDVI 
 1.08855877  0.13906306 -0.28437213  0.00643859  8.90365110 

lmStepAIC

Additionally, the caret package has method to compute stepwise regression using the MASS package (method = “lmStepAIC”):

Code
# Train the model
model.caret.mass <- train(SOC ~., data = df,
                    method = "lmStepAIC", 
                    trControl = train.control,
                    trace = FALSE
                    )
Code
# Model accuracy
model.caret.mass$results
  parameter     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1      none 3.791739 0.4420481 2.721031 0.6487727 0.09320663 0.4010154
Code
# Final model coefficients
model.caret.mass$finalModel

Call:
lm(formula = .outcome ~ Slope + MAT + MAP + NDVI, data = dat)

Coefficients:
(Intercept)        Slope          MAT          MAP         NDVI  
   1.088559     0.139063    -0.284372     0.006439     8.903651  
Code
# Summary of the model
summary(model.caret.mass$finalModel)

Call:
lm(formula = .outcome ~ Slope + MAT + MAP + NDVI, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.6203  -2.1059  -0.4424   1.3312  16.2115 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.088559   0.845323   1.288  0.19848    
Slope        0.139063   0.050332   2.763  0.00596 ** 
MAT         -0.284372   0.059988  -4.741 2.84e-06 ***
MAP          0.006439   0.001565   4.115 4.59e-05 ***
NDVI         8.903651   2.041594   4.361 1.60e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.81 on 462 degrees of freedom
Multiple R-squared:  0.4348,    Adjusted R-squared:  0.4299 
F-statistic: 88.84 on 4 and 462 DF,  p-value: < 2.2e-16

Compare the Models

We can compare the performance the MLR models with sub-set of predictors compare_performance() function.

Code
performance::compare_performance(model.full, model.caret.leaps$finalModel, model.caret.mass$finalModel)
# Comparison of Model Performance Indices

Name | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------------------------------------------
..1  |    lm | 2583.0 (0.321) | 2583.3 (0.315) | 2612.0 (0.056) | 0.435 |     0.429 | 3.787 | 3.812
..3  |    lm | 2581.5 (0.679) | 2581.7 (0.685) | 2606.4 (0.944) | 0.435 |     0.430 | 3.789 | 3.810

Exercise

  1. Create a R-Markdown documents (name homework_07.rmd) in this project and do all Tasks using the data shown below.

  2. Submit all codes and output as a HTML document (homework_07.html) before class of next week.

Required R-Package

tidyverse, broom, report, performance, ggmisc, jtools, MASS, leap, caret

Data

  1. bd_arsenic_data.csv

Download the data and save in your project directory. Use read_csv to load the data in your R-session. For example:

df<-read_csv(“bd_arsenic_data.csv”))

Tasks

  1. Create a full model MLR for predicting Soil As (SAs) with following variables:
  • WAs: Groundwater As

  • WP: Groundwater P

  • WFe: Groundwater Fe

  • WEc: Groundwater Ec

  • WpH: Groundwater pH

  • SAoFe: Oxalate Extractable

  • SpH: Soil pH

  • SEc: Soil Ec

  • SOC: Soil Organic Carbon

  • SAND: Sand

  • SILT: Silt

  • CLAY: Clay

  • SP: Soil pH

  • ELEV: Elevation

  • Year_Irrig: Year of Irrigation

  • DIS_STW: Distance from STW

  1. Use MASS and leap packages to select subset of optimum number of predictors and fit the models.

  2. Compare the performance the MLR models with sub-set of predictors by both MASS and leap packages . Use full model as a base-line mode.

  3. Use Stepwise regression with caret package, try both leapBackward and lmStepAIC

Further Reading

  1. Model Selection Essentials in R

  2. Variable Selection in Multiple Regression

  3. Akaike Information Criterion

  4. caret