Gradient Boosting Machine (GBM)

Gradient Boosting Machine (GBM) is a powerful and popular supervised learning algorithm used for both regression and classification tasks. It belongs to the family of ensemble methods, which combine multiple weak learners to form a stronger model. The idea behind gradient boosting is to iteratively improve a model’s predictions by fitting a new model to the residuals or errors of the previous models.

Whereas random forests build an ensemble of deep independent trees, GBMs build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms.

The main idea of boosting is to add new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far.

Here’s how GBM works:

  1. Initially, a simple model is fitted to the data. This can be any model such as a decision tree, linear regression, or logistic regression.

  2. The model’s predictions are compared with the actual values of the target variable, and the errors or residuals are calculated.

  3. A new model is then fitted to these residuals, with the aim of reducing the errors made by the first model. The new model is usually a decision tree, but can also be another type of model.

  4. The predictions of this new model are added to the predictions of the previous model, and the errors are recalculated.

  5. Steps 3 and 4 are repeated iteratively, with each new model attempting to reduce the errors of the previous model. The final predictions are the sum of the predictions of all the models.

  6. To avoid overfitting, the model can be regularized by adding constraints on the size or complexity of the trees, or by using early stopping.

Gradient boosting has several advantages over other methods, such as its ability to handle different types of data (e.g. numerical, categorical), its robustness to outliers and noise, and its high accuracy on many problems. However, it can also be computationally expensive and sensitive to hyperparameters, such as the learning rate, the number of trees, and the depth of the trees.

Diffrence between random forest and GBM

Random Forest (RF) and Gradient Boosting Machine (GBM) are both popular ensemble methods used for supervised learning tasks. While they share some similarities, there are several key differences between the two algorithms:

  1. Sampling approach: RF builds multiple decision trees on randomly sampled subsets of the data with replacement, while GBM builds trees sequentially on the residuals or errors of the previous trees.

  2. Learning process: RF trees are built independently, while GBM trees are built in a sequential manner, with each new tree attempting to improve upon the errors of the previous trees.

  3. Bias-variance tradeoff: RF tends to have a lower variance and higher bias, while GBM tends to have a lower bias and higher variance.

  4. Model interpretability: RF is generally more interpretable, as the feature importances can be easily calculated based on the frequency of their use in the trees. GBM, on the other hand, can be more challenging to interpret, as the importance of each feature depends on the interaction with other features and the specific stage of the boosting process.

  5. Performance: Both RF and GBM can perform well on a wide range of problems, but their performance may depend on the specific characteristics of the data and the problem at hand.

In general, RF may be a good choice for problems where model interpretability is important, or where there are many noisy or irrelevant features. GBM may be a good choice for problems where high accuracy is desired, and where the data has complex interactions between features. However, the choice between RF and GBM ultimately depends on the specific characteristics of the problem and the goals of the analysis.

Gradient Boosting Machine (GBM) with gbm package

The gbm R package is an implementation of extensions to Freund and Schapire’s AdaBoost algorithm and Friedman’s gradient boosting machine. This is the original R implementation of GBM.

To get started with the gbm package, you can install it from CRAN using the following code:

install.packages(“gbm”)

Code
library(gbm)
Data

In this exercise we will use following data se and use DEM, MAP, MAT, NAVI, NLCD, and FRG to fit a GBM regression model.

gp_soil_data.csv

Code
library(tidyverse)
# define data folder
dataFolder<-"E:/Dropbox/GitHub/Data/USA/"
# Load data
mf<-read_csv(paste0(dataFolder, "gp_soil_data.csv"))
# Create a data-frame
df<-mf %>% dplyr::select(SOC, DEM, Slope, Aspect, TPI, KFactor, SiltClay, MAT, MAP,NDVI, NLCD, FRG)%>%
    glimpse()
Rows: 467
Columns: 12
$ SOC      <dbl> 15.763, 15.883, 18.142, 10.745, 10.479, 16.987, 24.954, 6.288…
$ DEM      <dbl> 2229.079, 1889.400, 2423.048, 2484.283, 2396.195, 2360.573, 2…
$ Slope    <dbl> 5.6716146, 8.9138117, 4.7748051, 7.1218114, 7.9498644, 9.6632…
$ Aspect   <dbl> 159.1877, 156.8786, 168.6124, 198.3536, 201.3215, 208.9732, 2…
$ TPI      <dbl> -0.08572358, 4.55913162, 2.60588670, 5.14693117, 3.75570583, …
$ KFactor  <dbl> 0.31999999, 0.26121211, 0.21619999, 0.18166667, 0.12551020, 0…
$ SiltClay <dbl> 64.84270, 72.00455, 57.18700, 54.99166, 51.22857, 45.02000, 5…
$ MAT      <dbl> 4.5951686, 3.8599243, 0.8855000, 0.4707811, 0.7588266, 1.3586…
$ MAP      <dbl> 468.3245, 536.3522, 859.5509, 869.4724, 802.9743, 1121.2744, …
$ NDVI     <dbl> 0.4139390, 0.6939532, 0.5466033, 0.6191013, 0.5844722, 0.6028…
$ NLCD     <chr> "Shrubland", "Shrubland", "Forest", "Forest", "Forest", "Fore…
$ FRG      <chr> "Fire Regime Group IV", "Fire Regime Group IV", "Fire Regime …
Convert to factor
Code
df$NLCD <- as.factor(df$NLCD)
df$FRG <- as.factor(df$FRG)
Split Data

We use rsample package, install with tidymodels, to split data into training (70%) and test data (30%) set with Stratified Random Sampling. initial_split() creates a single binary split of the data into a training set and testing set.

Note

Stratified random sampling is a technique for selecting a representative sample from a population, where the sample is chosen in a way that ensures that certain subgroups within the population are adequately represented in the sample.

Code
#
#| fig.width: 4
#| fig.height: 4

library(tidymodels)
set.seed(1245)   # for reproducibility
split <- initial_split(df, prop = 0.8, strata = SOC)
train <- split %>% training()
test <-  split %>% testing()

# Density plot all, train and test data 
ggplot()+
  geom_density(data = df, aes(SOC))+
  geom_density(data = train, aes(SOC), color = "green")+
  geom_density(data = test, aes(SOC), color = "red") +
      xlab("Soil Organic Carbon (kg/g)") + 
     ylab("Density") 

Feature Scaling
Code
train[-c(1, 11,12)] = scale(train[-c(1,11,12)])
test[-c(1, 11,12)] = scale(test[-c(1,11,12)])
Fit a GBM model

We will fit a tree-based GBM model with a few parameters using gbm() function:

Code
set.seed(123)
# train GBM model
gbm.fit <- gbm(
          formula = SOC ~ ., 
          data = train,             # dataframe
          distribution = "gaussian",# a character string specifying  data distribution
          n.trees = 5000,           #  the total number of trees to fit, 
                                    # default 100
          interaction.depth = 1,    # maximum depth of each tree
                                    # A value of 1 implies an additive model, 
                                    # a value of 2 implies a model with up to 2-way interactions, etc. 
                                    # Default is 1
          shrinkage = 0.001,        # a shrinkage parameter applied to each tree in the expansion. 
                                    # Also known as the learning rate or step-size reduction; 
                                    # 0.001 to 0.1 usually work,
                                    # but a smaller learning rate typically requires more trees. 
                                    # Default is 0.1.
          bag.fraction = 0.7,       # the fraction of the training set observations randomly selected 
                                    # to propose the next tree in the expansion. 
                                    # This introduces randomnesses into the model fit. 
                                    # If bag.fraction < 1 then running the same model twice 
                                    # will result in similar but different fits.
                                    # gbm uses the R random number generator so set.seed 
                                    # can ensure that the model can be reconstructed.
                                    # Preferably, the user can save the returned gbm.object using save. 
                                    # Default is 0.5
          cv.folds = 5,             # cross-validation folds
          n.cores = NULL,           # will use all cores by default
          verbose = F           # Logical indicating whether or not to print out progress and performance indicators (TRUE)
         )  
Code
gbm.fit
gbm(formula = SOC ~ ., distribution = "gaussian", data = train, 
    n.trees = 5000, interaction.depth = 1, shrinkage = 0.001, 
    bag.fraction = 0.7, cv.folds = 5, verbose = F, n.cores = NULL)
A gradient boosted model with gaussian loss function.
5000 iterations were performed.
The best cross-validation iteration was 4999.
There were 11 predictors of which 11 had non-zero influence.
Get MSE and compute RMSE
Code
# get MSE and compute RMSE
sqrt(min(gbm.fit$cv.error))
[1] 4.048704
Plot loss function as a result of n trees added to the ensemble

Estimates the optimal number of boosting iterations for a gbm object and optionally plots various performance measures

Code
gbm.perf(gbm.fit, method = "cv")

[1] 4999
Prediction
Code
test$SOC.pred = predict(gbm.fit, newdata = test)
Code
library(Matrix)
RMSE<- Metrics::rmse(test$SOC, test$SOC.pred)
RMSE
[1] 3.55045
Code
library(ggpmisc)
formula<-y~x

ggplot(test, aes(SOC,SOC.pred)) +
  geom_point() +
  geom_smooth(method = "lm")+
  stat_poly_eq(use_label(c("eq", "adj.R2")), formula = formula) +
  ggtitle("Gradient Boosting Machine (GBM)") +
  xlab("Observed") + ylab("Predicted") +
  scale_x_continuous(limits=c(0,25), breaks=seq(0, 25, 5))+ 
  scale_y_continuous(limits=c(0,25), breaks=seq(0, 25, 5)) +
  # Flip the bars
  theme(
    panel.background = element_rect(fill = "grey95",colour = "gray75",size = 0.5, linetype = "solid"),
    axis.line = element_line(colour = "grey"),
    plot.title = element_text(size = 14, hjust = 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x=element_text(size=13, colour="black"),
    axis.text.y=element_text(size=13,angle = 90,vjust = 0.5, hjust=0.5, colour='black'))

Variable importance

method = relative.influence: At each split in each tree, gbm computes the improvement in the split-criterion (MSE for regression). gbm then averages the improvement made by each variable across all the trees that the variable is used. The variables with the largest average decrease in MSE are considered most important.

method = permutation.test.gbm: For each tree, the OOB sample is passed down the tree and the prediction accuracy is recorded. Then the values for each variable (one at a time) are randomly permuted and the accuracy is again computed. The decrease in accuracy as a result of this randomly “shaking up” of variable values is averaged over all the trees for each variable. The variables with the largest average decrease in accuracy are considered most important.

Code
summary(
  gbm.fit, 
  cBars = 10,
  method = relative.influence, # also can use permutation.test.gbm
  las = 2
  )

              var     rel.inf
NDVI         NDVI 51.63209061
Slope       Slope 16.82052947
KFactor   KFactor  9.07920252
MAP           MAP  8.68191448
MAT           MAT  6.55400113
TPI           TPI  4.24848710
SiltClay SiltClay  1.33348869
DEM           DEM  1.30516352
Aspect     Aspect  0.16689032
NLCD         NLCD  0.11403561
FRG           FRG  0.06419656

An alternative approach is to use the vip package, which provides ggplot2 plots. vip also provides an additional measure of variable importance based on partial dependence measures and is a common variable importance plotting framework for many machine learning models

Code
library(vip)
vip(gbm.fit)

Code
#  remove all object
rm(list = ls())

GBM with h20

h20 includes an implementation of gradient boosting machine (GBM) that can be used for both regression and classification problems. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.

Fit the Best GBM model with hyperparameter tunning

Data
Code
library(tidyverse)

# define data folder
dataFolder<-"E:/Dropbox/GitHub/Data/USA/"
# Load data
mf<-read_csv(paste0(dataFolder, "gp_soil_data.csv"))
# Create a data-frame
df<-mf %>% dplyr::select(SOC, DEM, Slope, Aspect, TPI, KFactor, SiltClay, MAT, MAP,NDVI, NLCD, FRG)%>%
    glimpse()
Rows: 467
Columns: 12
$ SOC      <dbl> 15.763, 15.883, 18.142, 10.745, 10.479, 16.987, 24.954, 6.288…
$ DEM      <dbl> 2229.079, 1889.400, 2423.048, 2484.283, 2396.195, 2360.573, 2…
$ Slope    <dbl> 5.6716146, 8.9138117, 4.7748051, 7.1218114, 7.9498644, 9.6632…
$ Aspect   <dbl> 159.1877, 156.8786, 168.6124, 198.3536, 201.3215, 208.9732, 2…
$ TPI      <dbl> -0.08572358, 4.55913162, 2.60588670, 5.14693117, 3.75570583, …
$ KFactor  <dbl> 0.31999999, 0.26121211, 0.21619999, 0.18166667, 0.12551020, 0…
$ SiltClay <dbl> 64.84270, 72.00455, 57.18700, 54.99166, 51.22857, 45.02000, 5…
$ MAT      <dbl> 4.5951686, 3.8599243, 0.8855000, 0.4707811, 0.7588266, 1.3586…
$ MAP      <dbl> 468.3245, 536.3522, 859.5509, 869.4724, 802.9743, 1121.2744, …
$ NDVI     <dbl> 0.4139390, 0.6939532, 0.5466033, 0.6191013, 0.5844722, 0.6028…
$ NLCD     <chr> "Shrubland", "Shrubland", "Forest", "Forest", "Forest", "Fore…
$ FRG      <chr> "Fire Regime Group IV", "Fire Regime Group IV", "Fire Regime …
Convert to factor
Code
df$NLCD <- as.factor(df$NLCD)
df$FRG <- as.factor(df$FRG)
Data split
Code
library(tidymodels)
set.seed(1245)   # for reproducibility
split.df <- initial_split(df, prop = 0.8, strata = SOC)
train <- split.df %>% training()
test <-  split.df %>% testing()
Feature Scaling
Code
train[-c(1, 11,12)] = scale(train[-c(1,11,12)])
test[-c(1, 11,12)] = scale(test[-c(1,11,12)])
Import h2o
Code
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "198g", enable_assertions = FALSE) 
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         10 seconds 859 milliseconds 
    H2O cluster timezone:       America/New_York 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.40.0.4 
    H2O cluster version age:    3 months and 23 days 
    H2O cluster name:           H2O_started_from_R_zahmed2_qtu738 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   198.00 GB 
    H2O cluster total cores:    40 
    H2O cluster allowed cores:  40 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.3.1 (2023-06-16 ucrt) 
Code
#disable progress bar for RMarkdown
h2o.no_progress() 
# Optional: remove anything from previous session
h2o.removeAll()   
Import data to h2o cluster
Code
h_df=as.h2o(df)
h_train = as.h2o(train)
h_test = as.h2o(test)
Code
CV.xy<- as.data.frame(h_train)
test.xy<- as.data.frame(h_test)
Define response and predictors
Code
y <- "SOC"
x <- setdiff(names(h_df), y)
Define GBM Hyper-parameters
Code
GBM_hyper_params = list( ntrees = c(100,500, 1000, 5000),
                     max_depth = c(2,5,8),
                     min_rows = c(8,10, 12),
                     learn_rate = c(0.001,0.05,0.06),
                     sample_rate = c(0.4,0.5,0.6, 0.7),
                     col_sample_rate = c(0.4,0.5,0.6),
                     col_sample_rate_per_tree = c(0.4,0.5,0.6))
GBM_hyper_params
$ntrees
[1]  100  500 1000 5000

$max_depth
[1] 2 5 8

$min_rows
[1]  8 10 12

$learn_rate
[1] 0.001 0.050 0.060

$sample_rate
[1] 0.4 0.5 0.6 0.7

$col_sample_rate
[1] 0.4 0.5 0.6

$col_sample_rate_per_tree
[1] 0.4 0.5 0.6
GBM Search Criteria
Code
search_criteria <- list(strategy = "RandomDiscrete", 
                        max_models = 40,
                        max_runtime_secs = 300,
                        stopping_tolerance = 0.001,
                        stopping_rounds = 2,
                        seed = 42)
Best GBM Model
Code
# number GBM
length(GBM_grid@model_ids)
[1] 31
Code
# Get  Model ID
GBM_get_grid <- h2o.getGrid("GBM_grid_ID",
                           sort_by="RMSE",
                           decreasing=F)
# Get the best RF model
best_GBM <- h2o.getModel(GBM_get_grid@model_ids[[1]])
Model performance
Code
# training performance
h2o.performance(best_GBM, h_train)
H2ORegressionMetrics: gbm

MSE:  11.50257
RMSE:  3.391544
MAE:  2.436229
RMSLE:  0.4559464
Mean Residual Deviance :  11.50257
Code
# CV-performance
h2o.performance(best_GBM, xval=TRUE)
H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  15.76155
RMSE:  3.970082
MAE:  2.864454
RMSLE:  0.5191092
Mean Residual Deviance :  15.76155
Code
# test performance
h2o.performance(best_GBM, h_test)
H2ORegressionMetrics: gbm

MSE:  11.42624
RMSE:  3.380273
MAE:  2.472756
RMSLE:  0.5001688
Mean Residual Deviance :  11.42624
Prediction
Code
test.pred.RF<-as.data.frame(h2o.predict(object = best_GBM, newdata = h_test))
test.xy$GBM_SOC<-test.pred.RF$predict

We can plot observed and predicted values with fitted regression line with ggplot2

Code
library(ggpmisc)
formula<-y~x

ggplot(test.xy, aes(SOC,GBM_SOC)) +
  geom_point() +
  geom_smooth(method = "lm")+
  stat_poly_eq(use_label(c("eq", "adj.R2")), formula = formula) +
  ggtitle("H2O: The Best GBM: Observed vs Predicted SOC ") +
  xlab("Observed") + ylab("Predicted") +
  scale_x_continuous(limits=c(0,25), breaks=seq(0, 25, 5))+ 
  scale_y_continuous(limits=c(0,25), breaks=seq(0, 25, 5)) +
  # Flip the bars
  theme(
    panel.background = element_rect(fill = "grey95",colour = "gray75",size = 0.5, linetype = "solid"),
    axis.line = element_line(colour = "grey"),
    plot.title = element_text(size = 14, hjust = 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x=element_text(size=13, colour="black"),
    axis.text.y=element_text(size=13,angle = 90,vjust = 0.5, hjust=0.5, colour='black'))

Model Explainability

Model explainability refers to the ability to understand and interpret the decisions made by a machine learning model. In other words, it is the ability to explain how a model arrives at its predictions or classifications.

Explainability is particularly important in applications where decisions made by the model have significant real-world consequences, such as in healthcare, finance, and legal fields. It is also important for regulatory compliance, where models must be auditable and transparent.

The h2o.explain() function generates a list of explanations – individual units of explanation such as a Partial Dependence plot or a Variable Importance plot. Most of the explanations are visual – these plots can also be created by individual utility functions outside the h2o.explain() function.

Retrieve the variable importance.
Code
h2o.varimp(best_GBM)
Variable Importances: 
   variable relative_importance scaled_importance percentage
1      NDVI       554005.062500          1.000000   0.247437
2       MAP       408371.750000          0.737126   0.182392
3     Slope       273328.375000          0.493368   0.122078
4       DEM       201504.187500          0.363723   0.089998
5   KFactor       201060.859375          0.362922   0.089800
6       MAT       181140.406250          0.326965   0.080903
7      NLCD       117436.132812          0.211977   0.052451
8  SiltClay       117232.757812          0.211610   0.052360
9       TPI        89372.101562          0.161320   0.039917
10   Aspect        57945.554688          0.104594   0.025880
11      FRG        37576.152344          0.067826   0.016783
Variable Importance Plot
Code
h2o.varimp_plot(best_GBM)

SHAP Local Explanation

SHAP (SHapley Additive exPlanations) values are a method for explaining the output of any machine learning model. SHAP values provide a way to attribute the prediction of an individual feature to its contribution to the final prediction, taking into account the interaction with other features in the model.

SHAP values are based on the Shapley value from cooperative game theory, which attributes a value to each player in a game based on their contribution to the game’s outcome. In the context of machine learning, the “players” are the input features of the model, and the “game” is the prediction made by the model.

The SHAP value for a particular feature is calculated by comparing the model’s prediction for a specific data point with and without that feature’s value included. This comparison is done for all possible subsets of features, and the contributions of each feature are averaged using the Shapley value formula.

The resulting SHAP values represent the contribution of each feature to the final prediction, with positive values indicating a positive impact on the prediction and negative values indicating a negative impact.

SHAP values can be used to provide insights into how a model is making its predictions and to identify which features are most important for a particular prediction. They can also be used to identify bias in a model and to ensure that the model is making predictions fairly and transparently.

H2O implements TreeSHAP which when the features are correlated, can increase contribution of a feature that had no influence on the prediction.

Code
h2o.shap_summary_plot(best_GBM, h_train)

Partial Dependence (PD) Plots

A partial dependence plot (PDP) is a graphical tool for understanding the relationship between a particular input feature and the output of a machine learning model.

A PDP shows the marginal effect of a single feature on the predicted outcome while holding all other features at a fixed value or their average value. The PDP can help to visualize the shape and direction of the relationship between the feature and the output, and can also help to identify any non-linearities or interactions between the feature and other features in the model.

To create a PDP, the value of the feature of interest is varied over its range, and the model’s predicted output is recorded for each value. The resulting data is then plotted on a graph, with the feature’s value on the x-axis and the predicted output on the y-axis.

PDPs can be used to gain insights into how a model is making its predictions and to identify which features are most important for the model’s output. They can also be used to identify potential biases in the model or to detect interactions between features that may be difficult to detect using other methods.

Code
h2o.pd_multi_plot(best_GBM, h_train, "NDVI")

Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see “striped” lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable

Code
h2o.residual_analysis_plot(best_GBM, h_train)

Code
#  remove all object
rm(list = ls())

Exercise

  1. Create a R-Markdown documents (name homework_14.rmd) in this project and do all Tasks using the data shown below.

  2. Submit all codes and output as a HTML document (homework_14.html) before class of next week.

Required R-Package

tidyverse, caret, Metrics, tidymodels, vip

Data

  1. bd_soil_update.csv

Download the data and save in your project directory. Use read_csv to load the data in your R-session. For example:

mf<-read_csv(“bd_soil_update.csv”)

Tasks

  1. Create a data-frame with following variables for Rajshahi Division:

    First use filter() to select data from Rajshai division and then use select() functions to create data-frame with following variables:

    SOM, DEM, NDVI, NDFI,

  2. Fit and show the model performance of a GBM model with using GBM packages.

  3. Find the best GBM model with grid search using h2o and show all steps as shown in the tutorials,

Further Reading

  1. Gradient Boosting Machines

YouTube Video

  1. GBM Regression Main Ideas

Source: StatQuest with Josh Starme