Regularized Generalized Linear Models

Regularized Generalized Linear Models (GLMs) are a family of statistical models that combine the benefits of generalized linear models (GLMs) with regularization techniques to improve generalization performance and reduce overfitting. Overfitting occurs when a model becomes too complex and starts fitting the noise in the data, resulting in poor performance on new, unseen data. Regularization addresses this issue by adding a penalty term to the loss function that the model is trying to minimize.

Several regularization techniques include L1 regularization (Lasso), L2 regularization (Ridge), and Elastic Net regularization. Like classical linear regression, Ridge and Lasso also build the linear model, but their fundamental peculiarity is regularization. These methods aim to improve the loss function so that it depends on the sum of the squared differences and the regression coefficients.

L1 regularization adds a penalty term proportional to the absolute value of the coefficients, which results in sparse solutions where some coefficients are precisely zero. This can be used for feature selection, as it tends to set coefficients of less important features to zero.

L2 regularization adds a penalty term proportional to the square of the coefficients, which tends to shrink the coefficients towards zero, but does not set them precisely to zero. This can be useful when all the features are necessary, but some may have minimal effects.

Elastic Net regularization combines L1 and L2 regularization and balances feature selection and coefficient shrinkage.

Regularization can be a powerful tool for improving the performance of machine learning models, mainly when dealing with high-dimensional data or datasets with a limited number of observations. It can help to prevent overfitting, improve the interpretability of the model, and provide insights into the importance of different features in the data.

Note

In machine learning, a loss function is a mathematical function that measures the difference between the predicted output of a model and the actual output given a set of input data. The loss function quantifies how well the model performs and provides a way to optimize the model’s parameters during training.

The goal of machine learning is to minimize the loss function, which represents the error between the predicted output of the model and the actual output. By minimizing the loss function, the model learns to make better predictions on new data that it has not seen before.

Ridge, Lasso, and Elastic Net Regressions

Lasso and Ridge and regression is a linear regression technique that adds an L1 and L2 regularization term to the cost function to prevent overfitting, respectively. The regularization term adds a penalty to the sum of squared regression coefficients, which shrinks the coefficients toward zero. The regularization parameter, alpha, controls the strength of the regularization. Larger alpha values result in more shrinkage of the coefficients towards zero and can lead to a simpler model with reduced variance but potentially increased bias. Smaller values of alpha result in less shrinkage and a model that is closer to ordinary linear regression.

Ridge and Lasso regressions can be used when there is multicollinearity among the predictor variables, which can cause instability in the estimates of the regression coefficients. By adding the regularization term, ridge regression can help stabilize the estimates and improve the model’s accuracy.

Elastic Net regularization is a combination of Lasso and Ridge regularization techniques. Elastic Net regularization adds two penalty terms to the cost function of a linear regression model: the L1 penalty and the L2 penalty. The L1 penalty encourages sparsity in the model by setting some of the coefficients to zero, while the L2 penalty shrinks the magnitude of the coefficients toward zero.

The Elastic Net penalty term is a weighted sum of the L1 and L2 penalties, controlled by a hyperparameter called alpha. When alpha is set to zero, Elastic Net is equivalent to Ridge regularization, and when alpha is set to one, Elastic Net is equivalent to Lasso regularization.

The “glmnet” method in caret has an alpha argument that determines what type of model is fit. If alpha = 0 then a ridge regression model is fit; if alpha = 1 then a lasso model is fi, a a value between 0 and 1 (say 0.3) for elastic net regression will be fit. Here we’ll use caret as a wrapper for glment package.

Data

In this exercise we will use following data set.

gp_soil_data.csv

Code
# define data folder
dataFolder<-"E:/Dropbox/GitHub/Data/USA/"
# Load Packages
library(tidyverse)
# Load data
mf<-read_csv(paste0(dataFolder, "gp_soil_data.csv"))
# Create a data-frame
df<-mf %>% dplyr::select(SOC, DEM, Slope, Aspect, TPI, KFactor, SiltClay, MAT, MAP,NDVI, NLCD, FRG)%>%
    glimpse()
Rows: 467
Columns: 12
$ SOC      <dbl> 15.763, 15.883, 18.142, 10.745, 10.479, 16.987, 24.954, 6.288…
$ DEM      <dbl> 2229.079, 1889.400, 2423.048, 2484.283, 2396.195, 2360.573, 2…
$ Slope    <dbl> 5.6716146, 8.9138117, 4.7748051, 7.1218114, 7.9498644, 9.6632…
$ Aspect   <dbl> 159.1877, 156.8786, 168.6124, 198.3536, 201.3215, 208.9732, 2…
$ TPI      <dbl> -0.08572358, 4.55913162, 2.60588670, 5.14693117, 3.75570583, …
$ KFactor  <dbl> 0.31999999, 0.26121211, 0.21619999, 0.18166667, 0.12551020, 0…
$ SiltClay <dbl> 64.84270, 72.00455, 57.18700, 54.99166, 51.22857, 45.02000, 5…
$ MAT      <dbl> 4.5951686, 3.8599243, 0.8855000, 0.4707811, 0.7588266, 1.3586…
$ MAP      <dbl> 468.3245, 536.3522, 859.5509, 869.4724, 802.9743, 1121.2744, …
$ NDVI     <dbl> 0.4139390, 0.6939532, 0.5466033, 0.6191013, 0.5844722, 0.6028…
$ NLCD     <chr> "Shrubland", "Shrubland", "Forest", "Forest", "Forest", "Fore…
$ FRG      <chr> "Fire Regime Group IV", "Fire Regime Group IV", "Fire Regime …

Ridge, Lasso, and Elastic Net Regressions with caret package

Create A Full Set of Dummy Variables

dummyVars() from caret package creates a full set of dummy variables (i.e. less than full rank parameterization)

Code
library(caret)
# create dummy variable
dummies <- dummyVars(SOC ~ ., data = df)
dummies.df<-as.data.frame(predict(dummies, newdata = df))
dummies.df$SOC<-df$SOC

Data Splitting

The function createDataPartition of caret package can be used to create balanced splits of the data

Code
set.seed(3456)
trainIndex <- createDataPartition(dummies.df$SOC, p = .80, 
                                  list = FALSE, 
                                  times = 1)
df_train <- dummies.df[ trainIndex,]
df_test  <- dummies.df[-trainIndex,]
Code
# Density plot all, train and test data 
ggplot()+
  geom_density(data = dummies.df, aes(SOC))+
  geom_density(data = df_train, aes(SOC), color = "green")+
  geom_density(data = df_test, aes(SOC), color = "red") +
      xlab("Soil Organic Carbon (kg/g)") + 
     ylab("Density") 

Set control prameters

Code
set.seed(123)
train.control <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 5,
                              preProc = c("center", "scale", "nzv"))

Train the Lesso and Ridge models

Code
# Lasso Regression
model.lasso <- train(SOC ~., data = df_train, 
                     method = "glmnet",
                     tuneGrid = expand.grid(alpha =1, lambda =1), 
                     trControl = train.control)

# Ridge Regression
model.ridge <- train(SOC ~., data = df_train, 
                     method = "glmnet",
                     tuneGrid = expand.grid(alpha =0, lambda =1), 
                     trControl = train.control)

Summary Lasso and Ridege models

Code
print(model.lasso)
glmnet 

375 samples
 19 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 338, 336, 337, 338, 338, 337, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  4.109635  0.3864478  3.004719

Tuning parameter 'alpha' was held constant at a value of 1
Tuning
 parameter 'lambda' was held constant at a value of 1
Code
print(model.ridge)
glmnet 

375 samples
 19 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 337, 338, 339, 335, 338, 339, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  3.872206  0.4199461  2.792984

Tuning parameter 'alpha' was held constant at a value of 0
Tuning
 parameter 'lambda' was held constant at a value of 1

Prediction

Code
df_test$Pred.lasso = predict(model.lasso, df_test)
df_test$Pred.ridge = predict(model.ridge, df_test)

Model Performance

Code
library(Matrix)
RMSE.lasso<- Metrics::rmse(df_test$SOC, df_test$Pred.lasso)
RMSE.ridge<- Metrics::rmse(df_test$SOC, df_test$Pred.ridge)
RMSE.ridge
[1] 3.6876
Code
RMSE.lasso
[1] 4.076164

We can plot observed and predicted values with fitted regression line with ggplot2

Code
library(ggpmisc)
formula<-y~x

# Lasso regression
p1=ggplot(df_test, aes(SOC,Pred.lasso)) +
  geom_point() +
  geom_smooth(method = "lm")+
  stat_poly_eq(use_label(c("eq", "adj.R2")), formula = formula) +
  ggtitle("Lasso Regression ") +
  xlab("Observed") + ylab("Predicted") +
  scale_x_continuous(limits=c(0,25), breaks=seq(0, 25, 5))+ 
  scale_y_continuous(limits=c(0,25), breaks=seq(0, 25, 5)) +
  # Flip the bars
  theme(
    panel.background = element_rect(fill = "grey95",colour = "gray75",size = 0.5, linetype = "solid"),
    axis.line = element_line(colour = "grey"),
    plot.title = element_text(size = 14, hjust = 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x=element_text(size=13, colour="black"),
    axis.text.y=element_text(size=13,angle = 90,vjust = 0.5, hjust=0.5, colour='black'))

# Ridge regression
p2=ggplot(df_test, aes(SOC,Pred.ridge)) +
  geom_point() +
  geom_smooth(method = "lm")+
  stat_poly_eq(use_label(c("eq", "adj.R2")), formula = formula) +
  ggtitle("Ridge Regression") +
  xlab("Observed") + ylab("Predicted") +
  scale_x_continuous(limits=c(0,25), breaks=seq(0, 25, 5))+ 
  scale_y_continuous(limits=c(0,25), breaks=seq(0, 25, 5)) +
  # Flip the bars
  theme(
    panel.background = element_rect(fill = "grey95",colour = "gray75",size = 0.5, linetype = "solid"),
    axis.line = element_line(colour = "grey"),
    plot.title = element_text(size = 14, hjust = 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x=element_text(size=13, colour="black"),
    axis.text.y=element_text(size=13,angle = 90,vjust = 0.5, hjust=0.5, colour='black'))

# Plot side by side
library(patchwork)
p1+p2

Train Elastic Net Regessiion

The elastic net regression can be easily computed using the caret workflow, with glmnet package. We use caret to automatically select the best tuning parameters alpha and lambda. The caret packages tests a range of possible alpha and lambda values, then selects the best values for lambda and alpha, resulting to a final model that is an elastic net model.

Here, we’ll test the combination of 10 different values for alpha and lambda. This is specified using the option tuneLength. The best alpha and lambda values are those values that minimize the cross-validation error.

Code
# Elastic Net Regression
model.elasticNet <- train(SOC ~., data = df_train, 
                     method = "glmnet",
                     trControl = train.control,
                     tuneLength = 10)

The Best tuning parameter

Code
model.elasticNet$bestTune
   alpha    lambda
57   0.6 0.2044128

Model coefficients

Code
coef(model.elasticNet$finalModel, model.elasticNet$bestTune$lambda)
20 x 1 sparse Matrix of class "dgCMatrix"
                                      s1
(Intercept)                 5.108541e-01
DEM                         .           
Slope                       8.319210e-02
Aspect                      6.842909e-03
TPI                        -2.200308e-06
KFactor                     .           
SiltClay                    1.664273e-02
MAT                        -2.956952e-01
MAP                         5.761716e-03
NDVI                        7.333893e+00
NLCDForest                  6.274464e-01
NLCDHerbaceous              .           
`NLCDPlanted/Cultivated`    .           
NLCDShrubland               .           
`FRGFire Regime Group I`   -1.640581e+00
`FRGFire Regime Group II`   .           
`FRGFire Regime Group III`  .           
`FRGFire Regime Group IV`  -8.237457e-01
`FRGFire Regime Group V`    .           
`FRGIndeterminate FRG`     -1.944834e-01

Prediction

Code
df_test$Pred.elasticNET = predict(model.elasticNet, df_test)
Code
library(Matrix)
RMSE.elasticNET<- Metrics::rmse(df_test$SOC, df_test$Pred.elasticNET)
RMSE.elasticNET
[1] 3.674035
Code
library(ggpmisc)
formula<-y~x

ggplot(df_test, aes(SOC,Pred.elasticNET)) +
  geom_point() +
  geom_smooth(method = "lm")+
  stat_poly_eq(use_label(c("eq", "adj.R2")), formula = formula) +
  ggtitle("Elastic Net Regression") +
  xlab("Observed") + ylab("Predicted") +
  scale_x_continuous(limits=c(0,25), breaks=seq(0, 25, 5))+ 
  scale_y_continuous(limits=c(0,25), breaks=seq(0, 25, 5)) +
  # Flip the bars
  theme(
    panel.background = element_rect(fill = "grey95",colour = "gray75",size = 0.5, linetype = "solid"),
    axis.line = element_line(colour = "grey"),
    plot.title = element_text(size = 14, hjust = 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x=element_text(size=13, colour="black"),
    axis.text.y=element_text(size=13,angle = 90,vjust = 0.5, hjust=0.5, colour='black'))

Ridge Regressions with tidymodels

The tidymodels provides a comprehensive framework for building, tuning, and evaluating regularized GLM models in R while following the principles of the tidyverse.

install.packages(“tidymodels”)

Split Data

We use rsample package, install with tidymodels, to split data into training (70%) and test data (30%) set with Stratified Random Sampling. initial_split() creates a single binary split of the data into a training set and testing set.

Note

Stratified random sampling is a technique for selecting a representative sample from a population, where the sample is chosen in a way that ensures that certain subgroups within the population are adequately represented in the sample.

Code
#
#| fig.width: 4
#| fig.height: 4

library(tidymodels)
set.seed(1245)   # for reproducibility
split <- initial_split(df, prop = 0.8, strata = SOC)
train <- split %>% training()
test <-  split %>% testing()

# Set 10 fold cross-validation data set 
cv_fold <- vfold_cv(train, v = 10)

# Density plot all, train and test data 
ggplot()+
  geom_density(data = df, aes(SOC))+
  geom_density(data = train, aes(SOC), color = "green")+
  geom_density(data = test, aes(SOC), color = "red") +
      xlab("Soil Organic Carbon (kg/g)") + 
     ylab("Density") 

Create Recipe

A recipe is a description of the steps to be applied to a data set in order to prepare it for data analysis. Before training the model, we can use a recipe to do some preprocessing required by the model.

Code
# load library
library(tidymodels)

# Create a recipe
model_recipe <-
  recipe(SOC ~ ., data = train) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal()) %>%
  step_normalize(all_numeric_predictors())

Build a Ridge Regression model

set_engine() is used to specify which package or system will be used to fit the model, along with any arguments specific to that software. We will use “glm” without any regularization parameters. The mixture argument specifies the amount of different types of regularization, mixture = 0 specifies only ridge regularization and mixture = 1 specifies only lasso regularization. When using the glmnet engine, we also need to set a penalty to be able to fit the model. The penalty = tune() tells tune_grid() that penalty parameter should be tuned.

Code
ridge_mod <- 
   linear_reg(mixture = 0, penalty = tune()) %>%
   set_mode("regression") %>%
   set_engine("glmnet")

Creatae a workflow

A workflow is a container object that aggregates information required to fit and predict from a model. This information might be a recipe used in pre-processing, specified through add_recipe(), or the model specification to fit, specified through add_model().

Code
ridge_wflow <- 
  workflow() %>% 
  add_model(ridge_mod) %>% 
  add_recipe(model_recipe)

Set Penalty Grid

This can be created using grid_regular() which creates a grid of evenly spaces parameter values. We use the penalty() function from the dials package to denote the parameter and set the range of the grid we are searching for. Note that this range is log-scaled.

Code
penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)
penalty_grid
# A tibble: 50 × 1
     penalty
       <dbl>
 1 0.00001  
 2 0.0000160
 3 0.0000256
 4 0.0000409
 5 0.0000655
 6 0.000105 
 7 0.000168 
 8 0.000268 
 9 0.000429 
10 0.000687 
# ℹ 40 more rows

Grid search for the best penalty

Code
tune_res <- tune_grid(
  ridge_wflow,
  resamples = cv_fold, 
  grid = penalty_grid
)

tune_res
# Tuning results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits           id     .metrics           .notes          
   <list>           <chr>  <list>             <list>          
 1 <split [333/38]> Fold01 <tibble [100 × 5]> <tibble [1 × 3]>
 2 <split [334/37]> Fold02 <tibble [100 × 5]> <tibble [1 × 3]>
 3 <split [334/37]> Fold03 <tibble [100 × 5]> <tibble [1 × 3]>
 4 <split [334/37]> Fold04 <tibble [100 × 5]> <tibble [1 × 3]>
 5 <split [334/37]> Fold05 <tibble [100 × 5]> <tibble [1 × 3]>
 6 <split [334/37]> Fold06 <tibble [100 × 5]> <tibble [1 × 3]>
 7 <split [334/37]> Fold07 <tibble [100 × 5]> <tibble [1 × 3]>
 8 <split [334/37]> Fold08 <tibble [100 × 5]> <tibble [1 × 3]>
 9 <split [334/37]> Fold09 <tibble [100 × 5]> <tibble [1 × 3]>
10 <split [334/37]> Fold10 <tibble [100 × 5]> <tibble [1 × 3]>

There were issues with some computations:

  - Warning(s) x10: A correlation computation is required, but `estimate` is constant...

Run `show_notes(.Last.tune.result)` for more information.

The output of tune_grid() can be visualize with autoplot() function:

Code
autoplot(tune_res)

We can also see the raw metrics that created this chart by calling collect_matrics():

Code
collect_metrics(tune_res)
# A tibble: 100 × 7
     penalty .metric .estimator  mean     n std_err .config              
       <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1 0.00001   rmse    standard   3.90     10  0.300  Preprocessor1_Model01
 2 0.00001   rsq     standard   0.393    10  0.0399 Preprocessor1_Model01
 3 0.0000160 rmse    standard   3.90     10  0.300  Preprocessor1_Model02
 4 0.0000160 rsq     standard   0.393    10  0.0399 Preprocessor1_Model02
 5 0.0000256 rmse    standard   3.90     10  0.300  Preprocessor1_Model03
 6 0.0000256 rsq     standard   0.393    10  0.0399 Preprocessor1_Model03
 7 0.0000409 rmse    standard   3.90     10  0.300  Preprocessor1_Model04
 8 0.0000409 rsq     standard   0.393    10  0.0399 Preprocessor1_Model04
 9 0.0000655 rmse    standard   3.90     10  0.300  Preprocessor1_Model05
10 0.0000655 rsq     standard   0.393    10  0.0399 Preprocessor1_Model05
# ℹ 90 more rows

The “best” values of this can be selected using select_best(), this function requires us to specify a matric that it should select against.

Code
best_penalty <- select_best(tune_res, metric = "rsq")
best_penalty
# A tibble: 1 × 2
  penalty .config              
    <dbl> <chr>                
1    1.26 Preprocessor1_Model26

Fit the model with the best penalty value

This value of penalty can then be used with finalize_workflow() to update/finalize the recipe by replacing tune() with the value of best_penalty. Now, this model should be fit again, this time using the whole training data set.

Code
# update penalty
ridge_final <- finalize_workflow(ridge_wflow, best_penalty)
# fit the model
ridge_final_fit <- fit(ridge_final, 
                       data = train)

This object has the finalized recipe and fitted model objects inside. You may want to extract the model or recipe objects from the workflow. To do this, you can use the helper functions extract_fit_parsnip():

Code
ridge_final_fit %>% 
  extract_fit_parsnip() %>% 
  tidy()
# A tibble: 18 × 3
   term                      estimate penalty
   <chr>                        <dbl>   <dbl>
 1 (Intercept)                 6.32      1.26
 2 DEM                         0.166     1.26
 3 Slope                       0.754     1.26
 4 Aspect                      0.131     1.26
 5 TPI                        -0.0892    1.26
 6 KFactor                     0.0326    1.26
 7 SiltClay                    0.371     1.26
 8 MAT                        -0.860     1.26
 9 MAP                         0.978     1.26
10 NDVI                        1.15      1.26
11 NLCD_Herbaceous            -0.150     1.26
12 NLCD_Planted.Cultivated    -0.146     1.26
13 NLCD_Shrubland             -0.353     1.26
14 FRG_Fire.Regime.Group.II    0.383     1.26
15 FRG_Fire.Regime.Group.III   0.0854    1.26
16 FRG_Fire.Regime.Group.IV    0.167     1.26
17 FRG_Fire.Regime.Group.V    -0.110     1.26
18 FRG_Indeterminate.FRG      -0.0548    1.26

Prediction

Code
test$ridge.SOC<-as.data.frame(predict(ridge_final_fit, new_data = test))
Code
RMSE<- Metrics::rmse(test$SOC, test$ridge.SOC$.pred)
MAE<- Metrics::mae(test$SOC, test$ridge.SOC$.pred)
MSE<- Metrics::mse(test$SOC, test$ridge.SOC$.pred)
MDAE<- Metrics::mdae(test$SOC, test$ridge.SOC$.pred)

# Print results
paste0("RMSE: ", round(RMSE,2))
[1] "RMSE: 3.52"
Code
paste0("MAE: ", round(MAE,2))
[1] "MAE: 2.62"
Code
paste0("MSE: ", round(MSE,2))
[1] "MSE: 12.4"
Code
paste0("MDAE: ", round(MDAE,2))
[1] "MDAE: 2.13"
Code
augment(ridge_final_fit, new_data = test) %>%
  rsq(truth = SOC, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.568
Code
library(ggpmisc)
formula<-y~x

ggplot(test, aes(SOC,ridge.SOC$.pred)) +
  geom_point() +
  geom_smooth(method = "lm")+
  stat_poly_eq(use_label(c("eq", "adj.R2")), formula = formula) +
  ggtitle("Ridge Regressin  with Grid Search ") +
  xlab("Observed") + ylab("Predicted") +
  scale_x_continuous(limits=c(0,25), breaks=seq(0, 25, 5))+ 
  scale_y_continuous(limits=c(0,25), breaks=seq(0, 25, 5)) +
  # Flip the bars
  theme(
    panel.background = element_rect(fill = "grey95",colour = "gray75",size = 0.5, linetype = "solid"),
    axis.line = element_line(colour = "grey"),
    plot.title = element_text(size = 14, hjust = 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x=element_text(size=13, colour="black"),
    axis.text.y=element_text(size=13,angle = 90,vjust = 0.5, hjust=0.5, colour='black'))

GLM with with h20

The h2o package provides an open-source implementation of GLM that is specifically designed for big data. To use the GLM function in h2o, you will need to first install and load the package in your R environment.

H2O supports elastic net regularization, which is a combination of the l1 and l2 penalties parametrized by the alpha and lambda arguments.

alpha controls the elastic net penalty distribution between the l1 and l2 norms. It can have any value in the [0; 1] range or a vector of values (which triggers grid search). If alpha = 0, H2O solves the GLM using ridge regression. If alpha = 1, the Lasso penalty is used.

lambda controls the penalty strength. The range is any positive value or a vector of values (which triggers grid search).

We will use Random Grid Search (RGS) to find the optimal parameters for the GLM models. We employ the K-fold cross validation method to determine the optimal hyper-parameters from a set of all possible hyper-parameter value combinations of alpha and lambda.

We can view the results of the grid search by printing the grid object. We can use this information to select the best model based on its CV performance and then use it to make predictions on new data.

Convert to factor

Code
df$NLCD <- as.factor(df$NLCD)
df$FRG <- as.factor(df$FRG)

Data split

Code
library(tidymodels)
set.seed(1245)   # for reproducibility
split.df <- initial_split(df, prop = 0.8, strata = SOC)
train.df <- split.df %>% training()
test.df <-  split.df %>% testing()
# Set 10 fold cross-validation data set 
Code
#| warning: false
#| error: false

library(h2o)

----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit https://docs.h2o.ai

----------------------------------------------------------------------

Attaching package: 'h2o'
The following objects are masked from 'package:lubridate':

    day, hour, month, week, year
The following objects are masked from 'package:stats':

    cor, sd, var
The following objects are masked from 'package:base':

    %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc
Code
h2o.init(nthreads = -1, max_mem_size = "148g", enable_assertions = FALSE) 

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\zahmed2\AppData\Local\Temp\1\RtmpEdicW7\file34081f6eefe/h2o_zahmed2_started_from_r.out
    C:\Users\zahmed2\AppData\Local\Temp\1\RtmpEdicW7\file340818436689/h2o_zahmed2_started_from_r.err


Starting H2O JVM and connecting:  Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         3 seconds 205 milliseconds 
    H2O cluster timezone:       America/New_York 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.40.0.4 
    H2O cluster version age:    3 months and 23 days 
    H2O cluster name:           H2O_started_from_R_zahmed2_qtu738 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   148.00 GB 
    H2O cluster total cores:    40 
    H2O cluster allowed cores:  40 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.3.1 (2023-06-16 ucrt) 
Warning in h2o.clusterInfo(): 
Your H2O cluster version is (3 months and 23 days) old. There may be a newer version available.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
Code
#disable progress bar for RMarkdown
h2o.no_progress() 
# Optional: remove anything from previous session
h2o.removeAll()   

Import data to h2o cluster

Code
h_df=as.h2o(df)
h_train = as.h2o(train.df)
h_test = as.h2o(test.df)
Code
CV.xy<- as.data.frame(h_train)
test.xy<- as.data.frame(h_test)

Define response and predictors

Code
y <- "SOC"
x <- setdiff(names(h_df), y)

Grid Search for Hyperprameter

We must find the optimal values of the alpha and lambda regularization parameters to get the best possible model. To find the optimal values, H2O provides grid search over alpha and lambda and a particular form of grid search called “lambda search” over lambda. For a detailed explanation, refer to Regularization. The recommended way to find optimal regularization settings on H2O is to do a grid search over a few alpha values with an automatic lambda search for each alpha.

The alpha parameter controls the distribution between the l1 (Lasso) and l2 (Ridge regression) penalties. A value of 1.0 for alpha represents Lasso, and an alpha value of 0.0 produces ridge regression. The lambda parameter controls the amount of regularization applied. If lambda is 0.0, no regularization is applied and the alpha parameter is ignored. The default value for lambda is calculated by H2O using a heuristic based on the training data. If you let H2O calculate the value for lambda, you can see the chosen value in the model output.

Code
glm_hyper_params <-list(
             alpha = c(0,0.25,0.5,0.75,1),
             lambda = c(1, 0.5, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0))

Get Grid parameters

Code
glm_get_grid <- h2o.getGrid("glm_grid_ID",sort_by="RMSE",decreasing=FALSE)
glm_get_grid@summary_table[1,]
Hyper-Parameter Search Summary: ordered by increasing RMSE
  alpha lambda            model_ids    rmse
1  0.75    0.1 glm_grid_ID_model_14 3.97959
Code
# number of models
length(glm_grid@model_ids)
[1] 40

The Best GLM Model

Code
best_glm <- h2o.getModel(glm_get_grid@model_ids[[1]]) 
summary(best_glm)
Model Details:
==============

H2ORegressionModel: glm
Model Key:  glm_grid_ID_model_14 
GLM Model: summary
    family     link                            regularization
1 gaussian identity Elastic Net (alpha = 0.75, lambda = 0.1 )
  number_of_predictors_total number_of_active_predictors number_of_iterations
1                         19                           8                    2
       training_frame
1 train.df_sid_a665_3

H2ORegressionMetrics: glm
** Reported on training data. **

MSE:  14.79506
RMSE:  3.846435
MAE:  2.684092
RMSLE:  0.4910106
Mean Residual Deviance :  14.79506
R^2 :  0.4097689
Null Deviance :9299.692
Null D.o.F. :370
Residual Deviance :5488.967
Residual D.o.F. :362
AIC :2072.435



H2ORegressionMetrics: glm
** Reported on cross-validation data. **
** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  15.83711
RMSE:  3.979587
MAE:  2.767983
RMSLE:  0.5027996
Mean Residual Deviance :  15.83711
R^2 :  0.3681975
Null Deviance :9353.5
Null D.o.F. :370
Residual Deviance :5875.569
Residual D.o.F. :362
AIC :2097.687


Cross-Validation Metrics Summary: 
                             mean         sd cv_1_valid  cv_2_valid cv_3_valid
mae                      2.780505   0.556972   2.262140    2.816260   3.523586
mean_residual_deviance  15.804888   5.645949  11.926979   16.053085  22.922935
mse                     15.804888   5.645949  11.926979   16.053085  22.922935
null_deviance          935.350000 347.376740 653.781400 1423.715800 719.756200
r2                       0.340638   0.157166   0.210230    0.382105   0.066818
residual_deviance      587.556900 241.561570 465.152160  850.813540 664.765140
rmse                     3.922452   0.682524   3.453546    4.006630   4.787790
rmsle                    0.499673   0.076804   0.490024    0.532909   0.616595
                       cv_4_valid  cv_5_valid cv_6_valid  cv_7_valid
mae                      2.898708    2.573803   2.252031    3.400846
mean_residual_deviance  16.834517   14.355487   9.801908   15.777113
mse                     16.834517   14.355487   9.801908   15.777113
null_deviance          874.229250 1055.323900 485.732200 1076.116000
r2                       0.340965    0.368675   0.362403    0.544282
residual_deviance      555.539060  645.996950 303.859160  489.090500
rmse                     4.102989    3.788864   3.130800    3.972041
rmsle                    0.534765    0.441743   0.423654    0.591546
                        cv_8_valid cv_9_valid cv_10_valid
mae                       3.563235   2.467792    2.046644
mean_residual_deviance   27.432510  13.614654    9.329691
mse                      27.432510  13.614654    9.329691
null_deviance          1556.020800 629.988040  878.836500
r2                        0.286451   0.235644    0.608812
residual_deviance      1097.300400 476.512880  326.539180
rmse                      5.237606   3.689804    3.054454
rmsle                     0.546760   0.430427    0.388310

Scoring History: 
            timestamp   duration iterations negative_log_likelihood objective
1 2023-08-20 23:34:07  0.000 sec          0              9299.69196  25.06656
2 2023-08-20 23:34:07  0.004 sec          2                      NA        NA
  training_rmse training_deviance training_mae training_r2
1            NA                NA           NA          NA
2       3.84643          14.79506      2.68409     0.40977

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
                    variable relative_importance scaled_importance percentage
1                       NDVI            1.428084          1.000000   0.275518
2                        MAT            1.032379          0.722912   0.199175
3                        MAP            0.953076          0.667381   0.183875
4                      Slope            0.844157          0.591111   0.162862
5   FRG.Fire Regime Group II            0.513842          0.359812   0.099135
6                   SiltClay            0.317600          0.222396   0.061274
7                        TPI            0.068080          0.047672   0.013135
8                     Aspect            0.026052          0.018243   0.005026
9    FRG.Fire Regime Group I            0.000000          0.000000   0.000000
10 FRG.Fire Regime Group III            0.000000          0.000000   0.000000
11  FRG.Fire Regime Group IV            0.000000          0.000000   0.000000
12   FRG.Fire Regime Group V            0.000000          0.000000   0.000000
13     FRG.Indeterminate FRG            0.000000          0.000000   0.000000
14               NLCD.Forest            0.000000          0.000000   0.000000
15           NLCD.Herbaceous            0.000000          0.000000   0.000000
16   NLCD.Planted/Cultivated            0.000000          0.000000   0.000000
17            NLCD.Shrubland            0.000000          0.000000   0.000000
18                       DEM            0.000000          0.000000   0.000000
19                   KFactor            0.000000          0.000000   0.000000

Training performance

Code
 h2o.performance(best_glm, h_train)
H2ORegressionMetrics: glm

MSE:  14.79506
RMSE:  3.846435
MAE:  2.684092
RMSLE:  0.4910106
Mean Residual Deviance :  14.79506
R^2 :  0.4097689
Null Deviance :9299.692
Null D.o.F. :370
Residual Deviance :5488.967
Residual D.o.F. :362
AIC :2072.435

Cross-validation Metrics

Code
# CV - prediction
pred.cv<-as.data.frame(h2o.getFrame(best_glm@model[["cross_validation_holdout_predictions_frame_id"]][["name"]]))
# Get CV predicted values
CV.xy$SOC.pred<-pred.cv$predict
# load Metrics package
library(Metrics)
# Get several Metrics
cv.lm<-lm(SOC ~ SOC.pred, CV.xy)
cv.matrics <- cbind("RMSE"=rmse(CV.xy$SOC, CV.xy$SOC.pred),
              "MAE"=mae(CV.xy$SOC, CV.xy$SOC.pred),
              "Bias"=bias(CV.xy$SOC, CV.xy$SOC.pred),
              "MSE"= mse(CV.xy$SOC, CV.xy$SOC.pred),
              "MDAE"=mdae(CV.xy$SOC, CV.xy$SOC.pred),
               "R2"= summary(cv.lm)$r.squared
                  )
cv.matrics
         RMSE      MAE        Bias      MSE     MDAE        R2
[1,] 3.979587 2.767983 -0.01749009 15.83711 1.852055 0.3682142

Prediction at test locations

Code
h2o.performance(best_glm, h_test)
H2ORegressionMetrics: glm

MSE:  11.82858
RMSE:  3.43927
MAE:  2.577785
RMSLE:  0.4965142
Mean Residual Deviance :  11.82858
R^2 :  0.5567064
Null Deviance :2563.204
Null D.o.F. :95
Residual Deviance :1135.543
Residual D.o.F. :87
AIC :529.606
Code
# test - prediction
test.pred<-as.data.frame(h2o.predict(object = best_glm,newdata = h_test))

test.xy$SOC.pred<-test.pred$predict
test.lm<-lm(SOC ~ SOC.pred, test.xy)
test.matrics <- cbind("RMSE"=rmse(test.xy$SOC, test.xy$SOC.pred),
              "MAE"=mae(test.xy$SOC, test.xy$SOC.pred),
              "Bias"=bias(test.xy$SOC, test.xy$SOC.pred),
              "MSE"= mse(test.xy$SOC, test.xy$SOC.pred),
              "MDAE"=mdae(test.xy$SOC, CV.xy$SOC.pred),
              "R2"= summary(test.lm)$r.squared
              )
test.matrics
        RMSE      MAE       Bias      MSE    MDAE        R2
[1,] 3.43927 2.577785 -0.0159035 11.82858 3.30778 0.5807843
Code
library(ggpmisc)
formula<-y~x

ggplot(test.xy, aes(SOC,SOC.pred)) +
  geom_point() +
  geom_smooth(method = "lm")+
  stat_poly_eq(use_label(c("eq", "adj.R2")), formula = formula) +
  ggtitle("H2O: Observed vs Predicted SOC ") +
  xlab("Observed") + ylab("Predicted") +
  scale_x_continuous(limits=c(0,25), breaks=seq(0, 25, 5))+ 
  scale_y_continuous(limits=c(0,25), breaks=seq(0, 25, 5)) +
  # Flip the bars
  theme(
    panel.background = element_rect(fill = "grey95",colour = "gray75",size = 0.5, linetype = "solid"),
    axis.line = element_line(colour = "grey"),
    plot.title = element_text(size = 14, hjust = 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x=element_text(size=13, colour="black"),
    axis.text.y=element_text(size=13,angle = 90,vjust = 0.5, hjust=0.5, colour='black'))

Variable Importnace

Variable importance in GLM regression model represents the coefficient magnitudes. The standardized coefficients are returned if the standardize option is enabled (which is the default). These are the predictor weights of the standardized data and are included only for informational purposes like comparing the relative variable importance.

Code
# Retrieve the variable importance
h2o.varimp(best_glm)
Variable Importances: 
                    variable relative_importance scaled_importance percentage
1                       NDVI            1.428084          1.000000   0.275518
2                        MAT            1.032379          0.722912   0.199175
3                        MAP            0.953076          0.667381   0.183875
4                      Slope            0.844157          0.591111   0.162862
5   FRG.Fire Regime Group II            0.513842          0.359812   0.099135
6                   SiltClay            0.317600          0.222396   0.061274
7                        TPI            0.068080          0.047672   0.013135
8                     Aspect            0.026052          0.018243   0.005026
9    FRG.Fire Regime Group I            0.000000          0.000000   0.000000
10 FRG.Fire Regime Group III            0.000000          0.000000   0.000000
11  FRG.Fire Regime Group IV            0.000000          0.000000   0.000000
12   FRG.Fire Regime Group V            0.000000          0.000000   0.000000
13     FRG.Indeterminate FRG            0.000000          0.000000   0.000000
14               NLCD.Forest            0.000000          0.000000   0.000000
15           NLCD.Herbaceous            0.000000          0.000000   0.000000
16   NLCD.Planted/Cultivated            0.000000          0.000000   0.000000
17            NLCD.Shrubland            0.000000          0.000000   0.000000
18                       DEM            0.000000          0.000000   0.000000
19                   KFactor            0.000000          0.000000   0.000000

To plot variable importance for the model, we use the h2o.varimp_plot() function and pass the trained model as an argument. The function generates a bar plot of the relative importance of each input variable, sorted in descending order by importance.

The h2o.varimp_plot() function also has several optional arguments that you can use to customize the appearance of the plot, such as num_of_features to limit the number of features displayed on the plot, title to set a title for the plot, and col to set the color of the bars.

Code
h2o.varimp_plot(best_glm)

Exercise

  1. Create a R-Markdown documents (name homework_12.rmd) in this project and do all Tasks using the data shown below.

  2. Submit all codes and output as a HTML document (homework_12.html) before class of next week.

Required R-Package

tidyverse, caret, Metrics, ggpmisc, report, performance, tidymodels

Data

  1. bd_soil_update.csv

Download the data and save in your project directory. Use read_csv to load the data in your R-session. For example:

mf<-read_csv(“bd_soil_update.csv”)

Tasks

  1. Create a data-frame for GLM model SOC with following variables for Rajshahi Division:

    First use filter() to select data from Rajshai division and then use select() functions to create data-frame with following variables:

    SOM, DEM, NDVI, NDFI, DC, Geo_Description, LT, Land_use

  2. Fit GLM Models with grid search for SOC with tidymodels and h20 packages

  3. Use best model to predict SOC on test data set. Show all steps of data processing, grid search, model fit and prediction.

  4. Plot h20 variable importance

Further Reading

  1. Linear Model Selection and Regularization

  2. Grid Search H20

  3. H20 variable importance

YouTube Video

  1. Regularization Part 1: Ridge (L2) Regression

Source: StatQuest with Josh Starme

  1. Regularization Part 2: Lasso (L1) Regression

Source: StatQuest with Josh Starme