Stack-Ensemble Model with H20

H2O’s Stacked Ensemble method is a supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking. Like all supervised models in H2O, Stacked Ensemeble supports regression, binary classification, and multiclass classification.

To create stacked ensembles with H2O in R, you can follow these general steps:

  1. Set up the Ensemble: Specify a list of L base algorithms (with a specific set of model parameters) and specify a metalearning algorithm

  2. Grid Search: Find the best base model for each L-base algorithms using hyperprameter grid search

  3. Train L-base models : Perform k-fold cross-validation on each of these learners and collect the cross-validated predicted values from each of the L algorithms.

  4. Prediction: The N cross-validated predicted values from each of the L algorithms can be combined to form a new N x L matrix. This matrix, along with the original response vector, is called the “level-one” data. (N = number of rows in the training set.

  5. Train with the Metalearner: Train the metalearning algorithm on the level-one data. The “ensemble model” consists of the L base learning models and the metalearning model, which can then be used to generate predictions on a test set.

  6. Predict with the stacked ensemble: Once the stack ensemble is trained, you can use it to make predictions on new, unseen data.

  7. Shutdown the H2O cluster: After you have finished using H2O, it’s good practice to shut down the H2O cluster by running h2o.shutdown().

Load Library

Code
library(tidyverse)
library(tidymodels)
library(Metrics)

Data

In this exercise we will use following synthetic data set and use DEM, Slope, TPI, MAT, MAP, NDVI, NLCD, and FRG to fit Deep Neural Network regression model. This data was created with AI using gp_soil_data data set.

gp_soil_data_syn.csv

Code
# define file from my github
urlfile = "https://github.com//zia207/r-colab/raw/main/Data/USA/gp_soil_data_syn.csv"
mf<-read_csv(url(urlfile))
# Create a data-frame
df<-mf %>% dplyr::select(SOC, DEM, Slope, TPI,MAT, MAP,NDVI, NLCD, FRG)%>%
    glimpse()
Rows: 1,408
Columns: 9
$ SOC   <dbl> 1.900, 2.644, 0.800, 0.736, 15.641, 8.818, 3.782, 6.641, 4.803, …
$ DEM   <dbl> 2825.1111, 2535.1086, 1716.3300, 1649.8933, 2675.3113, 2581.4839…
$ Slope <dbl> 18.981682, 14.182393, 1.585145, 9.399726, 12.569353, 6.358553, 1…
$ TPI   <dbl> -0.91606224, -0.15259802, -0.39078590, -2.54008722, 7.40076303, …
$ MAT   <dbl> 4.709227, 4.648000, 6.360833, 10.265385, 2.798550, 6.358550, 7.0…
$ MAP   <dbl> 613.6979, 597.7912, 201.5091, 298.2608, 827.4680, 679.1392, 508.…
$ NDVI  <dbl> 0.6845260, 0.7557631, 0.2215059, 0.2785148, 0.7337426, 0.7017139…
$ NLCD  <chr> "Forest", "Forest", "Shrubland", "Shrubland", "Forest", "Forest"…
$ FRG   <chr> "Fire Regime Group IV", "Fire Regime Group IV", "Fire Regime Gro…

Data Preprocessing

Convert to factor

Code
df$NLCD <- as.factor(df$NLCD)
df$FRG <- as.factor(df$FRG)

Data split

Code
set.seed(1245)   # for reproducibility
split_01 <- initial_split(df, prop = 0.8, strata = SOC)
train <- split_01 %>% training()
test <-  split_01 %>% testing()

Data Scaling

Code
train[-c(1, 8,9)] = scale(train[-c(1,8,9)])
test[-c(1, 8,9)] = scale(test[-c(1,8,9)])

Import h2o

Code
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "148g", enable_assertions = FALSE) 

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\zahmed2\AppData\Local\Temp\1\Rtmp2JEsAa\file713828706819/h2o_zahmed2_started_from_r.out
    C:\Users\zahmed2\AppData\Local\Temp\1\Rtmp2JEsAa\file7138c0e7b4f/h2o_zahmed2_started_from_r.err


Starting H2O JVM and connecting:  Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         3 seconds 265 milliseconds 
    H2O cluster timezone:       America/New_York 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.40.0.4 
    H2O cluster version age:    3 months and 23 days 
    H2O cluster name:           H2O_started_from_R_zahmed2_yqz807 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   148.00 GB 
    H2O cluster total cores:    40 
    H2O cluster allowed cores:  40 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.3.1 (2023-06-16 ucrt) 
Code
#disable progress bar for RMarkdown
h2o.no_progress() 
# Optional: remove anything from previous session
h2o.removeAll()   

Import data to h2o cluster

Code
h_df=as.h2o(df)
h_train = as.h2o(train)
h_test = as.h2o(test)
Code
train.xy<- as.data.frame(h_train)
test.xy<- as.data.frame(h_test)

Define response and predictors

Code
y <- "SOC"
x <- setdiff(names(h_df), y)

Stack-Ensemble Model: The Best of Families

Code
# rf-parameters
rf_params <-   list(
                    ntrees = 2000,                       
                    max_depth =25,                       
                    sample_rate = 0.8,                   
                    stopping_tolerance = 0.001,
                    stopping_rounds = 3,
                    stopping_metric = "RMSE")


stack_best <-h2o.stackedEnsemble(
                               model_id = "stack_RF_ID",
                               x= x,
                               y = y,
                               training_frame = h_train,
                               #validation_frame = h_valid,
                               base_models = list(best_GLM, 
                                                  best_RF, 
                                                  best_GBM,
                                                  best_DNN
                                                  ),
                               metalearner_algorithm = "drf",
                               metalearner_params = rf_params, 
                               metalearner_nfolds = 5,
                               seed = 42
                               )
stack_best

Prediction

Code
stack.test.best<-as.data.frame(h2o.predict(object = stack_best, newdata = h_test))
test.xy$Stack_SOC_best<-stack.test.best$predict
Code
RMSE.best<- Metrics::rmse(test.xy$SOC, test.xy$Stack_SOC_best)
MAE.best<- Metrics::mae(test.xy$SOC, test.xy$Stack_SOC_best)

# Print results
paste0("RMSE: ", round(RMSE.best,2))
[1] "RMSE: 2.96"
Code
paste0("MAE: ", round(MAE.best,2))
[1] "MAE: 2.13"
Code
library(ggpmisc)
formula<-y~x

ggplot(test.xy, aes(SOC,Stack_SOC_best)) +
  geom_point() +
  geom_smooth(method = "lm")+
  stat_poly_eq(use_label(c("eq", "adj.R2")), formula = formula) +
  ggtitle("Stack-Ensemble: The Best of Family") +
  xlab("Observed") + ylab("Predicted") +
  scale_x_continuous(limits=c(0,25), breaks=seq(0, 25, 5))+ 
  scale_y_continuous(limits=c(0,25), breaks=seq(0, 25, 5)) +
  # Flip the bars
  theme(
    panel.background = element_rect(fill = "grey95",colour = "gray75",size = 0.5, linetype = "solid"),
    axis.line = element_line(colour = "grey"),
    plot.title = element_text(size = 14, hjust = 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x=element_text(size=13, colour="black"),
    axis.text.y=element_text(size=13,angle = 90,vjust = 0.5, hjust=0.5, colour='black'))

Stack-Ensemble Model - All Models

Code
all_01 = append(GLM_grid@model_ids, RF_grid@model_ids)
all_02<-append(all_01,DNN_grid@model_ids)
all_03<-append(all_02,GBM_grid@model_ids)
length(all_03)
[1] 64
Code
stack_all<- h2o.stackedEnsemble(
                               model_id = "stack_Model_ALL_IDs",
                               x= x,
                               y = y,
                               training_frame = h_train,
                               base_models = all_03,
                               metalearner_algorithm = "drf",
                               metalearner_nfolds = 5,
                               metalearner_params = rf_params, 
                               keep_levelone_frame = TRUE,
                               seed=123)
stack_all
Model Details:
==============

H2ORegressionModel: stackedensemble
Model ID:  stack_Model_ALL_IDs 
Model Summary for Stacked Ensemble: 
                                         key
1                          Stacking strategy
2       Number of base models (used / total)
3           # GBM base models (used / total)
4           # GLM base models (used / total)
5           # DRF base models (used / total)
6  # DeepLearning base models (used / total)
7                      Metalearner algorithm
8         Metalearner fold assignment scheme
9                         Metalearner nfolds
10                   Metalearner fold_column
11        Custom metalearner hyperparameters
                                                                                                                                                                  value
1                                                                                                                                                      cross_validation
2                                                                                                                                                                 64/64
3                                                                                                                                                                   9/9
4                                                                                                                                                                 40/40
5                                                                                                                                                                   7/7
6                                                                                                                                                                   8/8
7                                                                                                                                                                   DRF
8                                                                                                                                                                Random
9                                                                                                                                                                     5
10                                                                                                                                                                   NA
11 {\n  "ntrees": [2000],\n  "max_depth": [25],\n  "sample_rate": [0.8],\n  "stopping_tolerance": [0.001],\n  "stopping_rounds": [3],\n  "stopping_metric": ["RMSE"]\n}


H2ORegressionMetrics: stackedensemble
** Reported on training data. **

MSE:  0.2737613
RMSE:  0.523222
MAE:  0.3143622
RMSLE:  0.07102723
Mean Residual Deviance :  0.2737613



H2ORegressionMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  3.588121
RMSE:  1.894234
MAE:  0.9595951
RMSLE:  0.2882307
Mean Residual Deviance :  3.588121


Cross-Validation Metrics Summary: 
                           mean       sd cv_1_valid cv_2_valid cv_3_valid
mae                    0.960719 0.099602   0.931848   0.835287   0.914153
mean_residual_deviance 3.647374 1.155167   2.811425   3.013221   2.834578
mse                    3.647374 1.155167   2.811425   3.013221   2.834578
r2                     0.854546 0.051894   0.878888   0.890242   0.899937
residual_deviance      3.647374 1.155167   2.811425   3.013221   2.834578
rmse                   1.892248 0.288902   1.676730   1.735863   1.683620
rmsle                  0.288314 0.024586   0.294013   0.300423   0.245133
                       cv_4_valid cv_5_valid
mae                      1.080815   1.041493
mean_residual_deviance   4.092324   5.485324
mse                      4.092324   5.485324
r2                       0.827205   0.776459
residual_deviance        4.092324   5.485324
rmse                     2.022950   2.342077
rmsle                    0.295891   0.306112

Prediction

Code
stack.test.all<-as.data.frame(h2o.predict(object = stack_all, newdata = h_test))
test.xy$Stack_SOC_all<-stack.test.all$predict
Code
RMSE.all<- Metrics::rmse(test.xy$SOC, test.xy$Stack_SOC_all)
MAE.all<- Metrics::mae(test.xy$SOC, test.xy$Stack_SOC_all)

# Print results
paste0("RMSE: ", round(RMSE.all,2))
[1] "RMSE: 3.27"
Code
paste0("MAE: ", round(MAE.all,2))
[1] "MAE: 2.23"
Code
library(ggpmisc)
formula<-y~x

ggplot(test.xy, aes(SOC,Stack_SOC_all)) +
  geom_point() +
  geom_smooth(method = "lm")+
  stat_poly_eq(use_label(c("eq", "adj.R2")), formula = formula) +
  ggtitle("Stack-Ensemble: All Models") +
  xlab("Observed") + ylab("Predicted") +
  scale_x_continuous(limits=c(0,25), breaks=seq(0, 25, 5))+ 
  scale_y_continuous(limits=c(0,25), breaks=seq(0, 25, 5)) +
  # Flip the bars
  theme(
    panel.background = element_rect(fill = "grey95",colour = "gray75",size = 0.5, linetype = "solid"),
    axis.line = element_line(colour = "grey"),
    plot.title = element_text(size = 14, hjust = 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x=element_text(size=13, colour="black"),
    axis.text.y=element_text(size=13,angle = 90,vjust = 0.5, hjust=0.5, colour='black'))

Further Reading

  1. Stacked Ensembles