Tidymodel Stacking

stacks is an R package for model stacking that aligns with the tidy models. Model stacking is an ensembling method that combines the outputs of many models and combines them to generate a new model—an ensemble in this package that generates predictions informed by each of its members.

The process goes something like this:

Define candidate ensemble members using functionality from rsample, parsnip, workflows, recipes, and tune.
Initialize a data_stack object with stacks()
Iteratively add candidate ensemble members (base models) to the data_stack with add_candidates()
Evaluate how to combine their predictions with blend_predictions()
Fit candidate ensemble members with non-zero stacking coefficients with fit_members()
Predict on new data with predict()

install.packages(“stacks”)

Load Packages

Code

library(tidymodels)
library(tidyverse)
library(stacks)

Data

Code

# define file from my github
urlfile = "https://github.com//zia207/r-colab/raw/main/Data/USA/gp_soil_data_syn.csv"
mf<-read_csv(url(urlfile))
# Create a data-frame
df<-mf %>% dplyr::select(SOC, DEM, Slope, TPI,MAT, MAP,NDVI, NLCD, FRG)%>%
    glimpse()

Rows: 1,408
Columns: 9
$ SOC   <dbl> 1.900, 2.644, 0.800, 0.736, 15.641, 8.818, 3.782, 6.641, 4.803, …
$ DEM   <dbl> 2825.1111, 2535.1086, 1716.3300, 1649.8933, 2675.3113, 2581.4839…
$ Slope <dbl> 18.981682, 14.182393, 1.585145, 9.399726, 12.569353, 6.358553, 1…
$ TPI   <dbl> -0.91606224, -0.15259802, -0.39078590, -2.54008722, 7.40076303, …
$ MAT   <dbl> 4.709227, 4.648000, 6.360833, 10.265385, 2.798550, 6.358550, 7.0…
$ MAP   <dbl> 613.6979, 597.7912, 201.5091, 298.2608, 827.4680, 679.1392, 508.…
$ NDVI  <dbl> 0.6845260, 0.7557631, 0.2215059, 0.2785148, 0.7337426, 0.7017139…
$ NLCD  <chr> "Forest", "Forest", "Shrubland", "Shrubland", "Forest", "Forest"…
$ FRG   <chr> "Fire Regime Group IV", "Fire Regime Group IV", "Fire Regime Gro…

Split data

Code

set.seed(1245)   # for reproducibility
split <- initial_split(df, prop = 0.8, strata = SOC)
train <- split %>% training()
test <-  split %>% testing()

# Set 10 fold cross-validation data set 
set.seed(136)
cv_folds <- vfold_cv(train, v = 5)

The basic recipe used in all models

Code

soc_rec <-
  recipe(SOC ~ ., data = train) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal()) %>%
  step_normalize(all_numeric_predictors())

Basic workflow

Code

# Basic workflow
soc_wf <-
  workflow() %>%
  add_recipe(soc_rec)

Control Rapper Function

Supply these light wrappers as the control argument in a tune::tune_grid(), tune::tune_bayes(), or tune::fit_resamples() call to return the needed elements for use in a data stack. These functions will return the appropriate control grid to ensure that assessment set predictions and information on model specifications and preprocessors is supplied in the resampling results object!

To integrate stack settings with your existing control settings, note that these functions call the appropriate ⁠tune::control_*⁠ function with the arguments ⁠save_pred = TRUE, save_workflow = TRUE⁠.

Code

# convenience function
ctrl_grid <- control_stack_grid()
ctrl_res <- control_stack_resamples()

Base Models

We will use five regression models to create a stacked ensemble model:

GLM - Regularization Regression

Generalized Linear Models (GLMs) with regularization, often referred to as regularized regression, are extensions of the traditional linear regression that add regularization terms to the cost function. Regularization helps prevent overfitting and can improve the model’s generalization performance by discouraging overly complex models.

In traditional linear regression, the objective is to minimize the sum of squared errors (SSE) between the predicted values and the actual target values. However, in regularized regression, an additional penalty term is added to the cost function. There are two common types of regularization used in regularized regression:

Code

# Model Specification
glm_spec <-
  linear_reg(
   penalty = tune(),
   mixture=tune(),
  ) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

# Workflow
glm_wflow <-
  soc_wf %>%
  add_model(glm_spec)

# Recipe
set.seed(1)
glm_res <-
  tune_grid(
    object = glm_wflow,
    resamples = cv_folds,
    grid = 20,
    control = ctrl_grid
  )

Random Forest

We will use the ranger package to build random forest models. ranger is a fast and efficient implementation of random forest algorithms, known for its speed and scalability. The tidymodels provides a comprehensive framework for building, tuning, and evaluating random forest model while following the principles of the tidyverse.

install.packages(“ranger”)

Code

# Model Specification
rf_spec <-
  rand_forest(
    mtry = tune(),
    min_n = tune(),
    trees = 500
  ) %>%
  set_mode("regression") %>%
  set_engine("ranger")

# Workflow
rf_wflow <-
  soc_wf %>%
  add_model(rf_spec)

# Recipe
set.seed(1)
rf_res <-
  tune_grid(
    object = rf_wflow,
    resamples = cv_folds,
    grid = 20,
    control = ctrl_grid
  )

Light GBM

The tidymodels provides a comprehensive framework for building, tuning, and evaluating LightGBM model while following the principles of the tidyverse.

install.packages(“lightgbm”)

Parsnip could not support an implementation for boost_tree regression model specifications using the lightgbm engine. The parsnip extension package “bonsai” implements support for this specification. We need to install it and load to continue.

install.packages(“bonsai”)

Code

# Model Specification
lightgbm_spec<-
  boost_tree(
    mtry = tune(), 
    trees = 500, 
    min_n = tune(), 
    tree_depth = tune(),
    loss_reduction = tune(), 
    learn_rate = tune(), 
    sample_size = 0.75
  ) %>% 
  set_mode("regression") %>%
  set_engine("lightgbm")

# Workflow
library(bonsai)
lightgbm_wflow <-
  soc_wf %>%
  add_model(lightgbm_spec)

# Recipe
set.seed(1)
lightgbm_res <-
  tune_grid(
    object = lightgbm_wflow,
    resamples = cv_folds,
    grid = 20,
    control = ctrl_grid
  )

Cubist

Cubist is a rule-based regression and classification model that combines decision trees with linear models to capture complex relationships in the data.

rules is a parsnip extension package with model definitions for rule-based models, including:

cubist models that have discrete rule sets that contain linear models with an ensemble method similar to boosting

classification rules where a ruleset is derived from an initial tree fit

rule-fit models that begin with rules extracted from a tree ensemble which are then added to a regularized linear or logistic regression.

install.packages(“rules”)

install.packages(“Cubist”)

The rules package provides engines for the models in the following table.

Code

# Model Specification
cubist_spec <-
  cubist_rules(
    committees = tune(),
    neighbors = tune(),
    max_rules = tune()
  )

# Workflow
library(rules)
cubist_wflow <-
  soc_wf %>%
  add_model(cubist_spec)

# Recipe
set.seed(1)
cubist_res <-
  tune_grid(
    object = cubist_wflow,
    resamples = cv_folds,
    grid = 20,
    control = ctrl_grid
  )

Support Vector Machine Regression

A Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification and regression tasks. SVMs are particularly effective when working with high-dimensional data and can handle both linear and non-linear relationships between variables.

The main idea behind SVM is to find an optimal hyperplane that separates data points of different classes with the largest possible margin. In other words, SVM tries to create a decision boundary that maximally separates the classes in the feature space. The data points that lie closest to the decision boundary are called support vectors.

install.packages(“kernlab”)

Code

# # Model Specification
svm_spec <- 
  svm_rbf(
    cost = tune(), 
    rbf_sigma = tune()
  ) %>%
  set_engine("kernlab") %>%
  set_mode("regression")

# add it to a workflow
svm_wflow <- 
  soc_wf %>% 
  add_model(svm_spec)

# tune cost and rbf_sigma and fit to the 5-fold cv
set.seed(1)
svm_res <- 
  tune_grid(
    svm_wflow, 
    resamples = cv_folds, 
    grid = 20,
    control = ctrl_grid
  )

Neural Network

A Neural Network (NN) is a type of artificial intelligence model inspired by the structure and functioning of the human brain. It is a powerful machine learning algorithm that can learn complex patterns and relationships in data to perform tasks such as classification, regression, pattern recognition, and more.

Code

# Model Specification
nnet_spec <-
  mlp(
    hidden_units = tune(),
    penalty = tune(),
    epochs = tune()
  ) %>%
  set_mode("regression") %>%
  set_engine("nnet")

nnet_wflow <-
  soc_wf %>%
  add_model(nnet_spec)

nnet_res <-
  tune_grid(
    object = nnet_wflow,
    resamples = cv_folds,
    grid = 20,
    control = ctrl_grid
  )

Fitting all base models

We will use a simple function called finalizer() from here that selects the best model, updates the workflow, fits the final model with the best parameters and pulls out the metrics.

Code

finalizer <- function(tuned, wkflow, split, model) {
  best_mod <- tuned %>%
    select_best("rmse")

  final_wf <- wkflow %>%
    finalize_workflow(best_mod)

  final_fit <-
    final_wf %>%
    last_fit(split)

  final_fit %>%
    collect_metrics() %>%
    mutate(model = model)
}

We have fitted 100 models, 20 models for each type (glm, random forest, lightgbm,cubist, and neural net). We can do a quick check of how well each of these models performed using finalize() function:

Code

bind_rows(
  finalizer(glm_res, glm_wflow, split, "glm"),
  finalizer(rf_res, rf_wflow, split, "rf"),
  finalizer(lightgbm_res, lightgbm_wflow, split, "lightgbm"),
  finalizer(cubist_res, cubist_wflow, split, "cubist"),
  finalizer(svm_res, svm_wflow, split, "svm"),
  finalizer(nnet_res, nnet_wflow, split, "nnet"),
) %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  arrange(rmse)

# A tibble: 6 × 3
  model     rmse   rsq
  <chr>    <dbl> <dbl>
1 cubist    1.23 0.938
2 rf        1.40 0.923
3 lightgbm  2.82 0.681
4 svm       3.35 0.547
5 glm       3.82 0.406
6 nnet     10.3  0.182

We can see that the cubist model has the best performance, followed by the rf.

Stack Ensemble Model

First, we start by initialising the stack with stacks() then add each candidate model with add_candidates(). Next we evaluate the candidate models with blend_predictions(), before finally training the non-zero members on the training data with fit_members().

Code

soc_stack <-
  # initialize the stack
  stacks() %>%
  # add each of the models
  add_candidates(glm_res) %>%
  add_candidates(rf_res) %>%
  add_candidates(lightgbm_res) %>%
  add_candidates(cubist_res) %>%
  add_candidates(svm_res) %>%
  add_candidates(nnet_res) %>%
  blend_predictions(penalty = range(c(0.5, 5),levels=50, mixture = 0), metric = metric_set(rmse)) %>% # evaluate candidate models
  fit_members() # fit non zero stacking coefficients

Code

soc_stack

── A stacked ensemble model ─────────────────────────────────────


Out of 120 possible candidate members, the ensemble retained 8.

Penalty: 0.

Mixture: 1.


The 8 highest weighted members are:

# A tibble: 8 × 3
  member            type          weight
  <chr>             <chr>          <dbl>
1 svm_res_1_09      svm_rbf      0.859  
2 lightgbm_res_1_16 boost_tree   0.470  
3 cubist_res_1_15   cubist_rules 0.440  
4 rf_res_1_05       rand_forest  0.320  
5 cubist_res_1_03   cubist_rules 0.276  
6 lightgbm_res_1_07 boost_tree   0.116  
7 lightgbm_res_1_14 boost_tree   0.00974
8 nnet_res_1_01     mlp          0.00497

Code

library(see) # for nice theme
autoplot(soc_stack, type = "weights") +
  theme_lucid()

Prediction

Stack Ensemble

Code

reg_metrics <- metric_set(rmse, rsq)
# get predictions with stack
soc_pred <- predict(soc_stack, test) %>%
  bind_cols(test)

Code

ggplot(soc_pred, aes(x = SOC, y = .pred)) +
  geom_point(alpha = 0.4) +
  coord_obs_pred() +
  labs(x = "Observed", y = "Predicted") +
  geom_abline(linetype = "dashed") +
  theme_lucid()

All candidate member models

Code

member_preds <- 
  test %>%
  select(SOC) %>%
  bind_cols(
    predict(
      soc_stack, 
      test, 
      members = TRUE
    )
  )

Model Performances

RMSE

Code

colnames(member_preds) %>%
  map_dfr(
    .f = rmse,
    truth = SOC, 
    data = member_preds
  ) %>%
  mutate(member = colnames(member_preds))

# A tibble: 10 × 4
   .metric .estimator .estimate member           
   <chr>   <chr>          <dbl> <chr>            
 1 rmse    standard        0    SOC              
 2 rmse    standard        1.20 .pred            
 3 rmse    standard        1.38 rf_res_1_05      
 4 rmse    standard        4.93 lightgbm_res_1_16
 5 rmse    standard        4.93 lightgbm_res_1_07
 6 rmse    standard        4.93 lightgbm_res_1_14
 7 rmse    standard        1.45 cubist_res_1_03  
 8 rmse    standard        1.26 cubist_res_1_15  
 9 rmse    standard        5.15 svm_res_1_09     
10 rmse    standard        4.30 nnet_res_1_01

R2

Code

colnames(member_preds) %>%
  map_dfr(
    .f = rsq,
    truth = SOC, 
    data = member_preds
  ) %>%
  mutate(member = colnames(member_preds))

# A tibble: 10 × 4
   .metric .estimator .estimate member           
   <chr>   <chr>          <dbl> <chr>            
 1 rsq     standard       1     SOC              
 2 rsq     standard       0.941 .pred            
 3 rsq     standard       0.927 rf_res_1_05      
 4 rsq     standard       0.541 lightgbm_res_1_16
 5 rsq     standard       0.503 lightgbm_res_1_07
 6 rsq     standard       0.484 lightgbm_res_1_14
 7 rsq     standard       0.916 cubist_res_1_03  
 8 rsq     standard       0.935 cubist_res_1_15  
 9 rsq     standard       0.380 svm_res_1_09     
10 rsq     standard       0.377 nnet_res_1_01

[![](Image/github.png){fig-align="left" width="301"}](https://github.com/zia207/r-colab/blob/main/tidy_model_stacking.ipynb) # Tidymodel Stacking {.unnumbered} ![](Image/stack-tidy.png){width="220"} [**stacks**](https://stacks.tidymodels.org/index.html) is an R package for model stacking that aligns with the tidy models. Model stacking is an ensembling method that combines the outputs of many models and combines them to generate a new model---an ensemble in this package that generates predictions informed by each of its members. The process goes something like this: 1. Define candidate ensemble members using functionality from [rsample](https://rsample.tidymodels.org/), [parsnip](https://parsnip.tidymodels.org/), [workflows](https://workflows.tidymodels.org/), [recipes](https://recipes.tidymodels.org/), and [tune](https://tune.tidymodels.org/). 2. Initialize a `data_stack` object with [`stacks()`](https://stacks.tidymodels.org/reference/stacks.html) 3. Iteratively add candidate ensemble members (base models) to the `data_stack` with [`add_candidates()`](https://stacks.tidymodels.org/reference/add_candidates.html) 4. Evaluate how to combine their predictions with [`blend_predictions()`](https://stacks.tidymodels.org/reference/blend_predictions.html) 5. Fit candidate ensemble members with non-zero stacking coefficients with [`fit_members()`](https://stacks.tidymodels.org/reference/fit_members.html) 6. Predict on new data with [`predict()`](https://rdrr.io/r/stats/predict.html) > install.packages("stacks") #### Load Packages ```{r} #| warning: false #| error: false library(tidymodels) library(tidyverse) library(stacks) ``` #### Data ```{r} #| warning: false #| error: false # define file from my github urlfile = "https://github.com//zia207/r-colab/raw/main/Data/USA/gp_soil_data_syn.csv" mf<-read_csv(url(urlfile)) # Create a data-frame df<-mf %>% dplyr::select(SOC, DEM, Slope, TPI,MAT, MAP,NDVI, NLCD, FRG)%>% glimpse() ``` #### Split data ```{r} #| warning: false #| error: false set.seed(1245) # for reproducibility split <- initial_split(df, prop = 0.8, strata = SOC) train <- split %>% training() test <- split %>% testing() # Set 10 fold cross-validation data set set.seed(136) cv_folds <- vfold_cv(train, v = 5) ``` ### The basic recipe used in all models ```{r} #| warning: false #| error: false soc_rec <- recipe(SOC ~ ., data = train) %>% step_zv(all_predictors()) %>% step_dummy(all_nominal()) %>% step_normalize(all_numeric_predictors()) ``` ### Basic workflow ```{r} #| warning: false #| error: false # Basic workflow soc_wf <- workflow() %>% add_recipe(soc_rec) ``` ### Control Rapper Function Supply these light wrappers as the control argument in a tune::tune_grid(), tune::tune_bayes(), or tune::fit_resamples() call to return the needed elements for use in a data stack. These functions will return the appropriate control grid to ensure that assessment set predictions and information on model specifications and preprocessors is supplied in the resampling results object! To integrate stack settings with your existing control settings, note that these functions call the appropriate ⁠tune::control\_\*⁠ function with the arguments ⁠save_pred = TRUE, save_workflow = TRUE⁠. ```{r} #| warning: false #| error: false # convenience function ctrl_grid <- control_stack_grid() ctrl_res <- control_stack_resamples() ``` ### Base Models We will use five regression models to create a stacked ensemble model: #### GLM - Regularization Regression Generalized Linear Models (GLMs) with regularization, often referred to as regularized regression, are extensions of the traditional linear regression that add regularization terms to the cost function. Regularization helps prevent overfitting and can improve the model's generalization performance by discouraging overly complex models. In traditional linear regression, the objective is to minimize the sum of squared errors (SSE) between the predicted values and the actual target values. However, in regularized regression, an additional penalty term is added to the cost function. There are two common types of regularization used in regularized regression: ```{r} #| warning: false #| error: false # Model Specification glm_spec <- linear_reg( penalty = tune(), mixture=tune(), ) %>% set_mode("regression") %>% set_engine("glmnet") # Workflow glm_wflow <- soc_wf %>% add_model(glm_spec) # Recipe set.seed(1) glm_res <- tune_grid( object = glm_wflow, resamples = cv_folds, grid = 20, control = ctrl_grid ) ``` #### Random Forest We will use the ranger package to build random forest models. ranger is a fast and efficient implementation of random forest algorithms, known for its speed and scalability. The tidymodels provides a comprehensive framework for building, tuning, and evaluating random forest model while following the principles of the tidyverse. > install.packages("ranger") ```{r} #| warning: false #| error: false # Model Specification rf_spec <- rand_forest( mtry = tune(), min_n = tune(), trees = 500 ) %>% set_mode("regression") %>% set_engine("ranger") # Workflow rf_wflow <- soc_wf %>% add_model(rf_spec) # Recipe set.seed(1) rf_res <- tune_grid( object = rf_wflow, resamples = cv_folds, grid = 20, control = ctrl_grid ) ``` #### Light GBM The tidymodels provides a comprehensive framework for building, tuning, and evaluating LightGBM model while following the principles of the tidyverse. > install.packages("lightgbm") Parsnip could not support an implementation for `boost_tree` regression model specifications using the `lightgbm` engine. The parsnip extension package "bonsai" implements support for this specification. We need to install it and load to continue. > install.packages("bonsai") ```{r} #| warning: false #| error: false # Model Specification lightgbm_spec<- boost_tree( mtry = tune(), trees = 500, min_n = tune(), tree_depth = tune(), loss_reduction = tune(), learn_rate = tune(), sample_size = 0.75 ) %>% set_mode("regression") %>% set_engine("lightgbm") # Workflow library(bonsai) lightgbm_wflow <- soc_wf %>% add_model(lightgbm_spec) # Recipe set.seed(1) lightgbm_res <- tune_grid( object = lightgbm_wflow, resamples = cv_folds, grid = 20, control = ctrl_grid ) ``` #### Cubist Cubist is a rule-based regression and classification model that combines decision trees with linear models to capture complex relationships in the data. **rules** is a **parsnip** extension package with model definitions for rule-based models, including: **cubist** models that have discrete rule sets that contain linear models with an ensemble method similar to boosting **classification rules** where a ruleset is derived from an initial tree fit **rule-fit** models that begin with rules extracted from a tree ensemble which are then added to a regularized linear or logistic regression. > install.packages("rules") > > install.packages("Cubist") The rules package provides engines for the models in the following table. ![](Image/rule_base%20model.png) ```{r} #| warning: false #| error: false # Model Specification cubist_spec <- cubist_rules( committees = tune(), neighbors = tune(), max_rules = tune() ) # Workflow library(rules) cubist_wflow <- soc_wf %>% add_model(cubist_spec) # Recipe set.seed(1) cubist_res <- tune_grid( object = cubist_wflow, resamples = cv_folds, grid = 20, control = ctrl_grid ) ``` #### Support Vector Machine Regression A Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification and regression tasks. SVMs are particularly effective when working with high-dimensional data and can handle both linear and non-linear relationships between variables. The main idea behind SVM is to find an optimal hyperplane that separates data points of different classes with the largest possible margin. In other words, SVM tries to create a decision boundary that maximally separates the classes in the feature space. The data points that lie closest to the decision boundary are called support vectors. > install.packages("kernlab") ```{r} # # Model Specification svm_spec <- svm_rbf( cost = tune(), rbf_sigma = tune() ) %>% set_engine("kernlab") %>% set_mode("regression") # add it to a workflow svm_wflow <- soc_wf %>% add_model(svm_spec) # tune cost and rbf_sigma and fit to the 5-fold cv set.seed(1) svm_res <- tune_grid( svm_wflow, resamples = cv_folds, grid = 20, control = ctrl_grid ) ``` #### Neural Network A Neural Network (NN) is a type of artificial intelligence model inspired by the structure and functioning of the human brain. It is a powerful machine learning algorithm that can learn complex patterns and relationships in data to perform tasks such as classification, regression, pattern recognition, and more. ```{r} #| warning: false #| error: false # Model Specification nnet_spec <- mlp( hidden_units = tune(), penalty = tune(), epochs = tune() ) %>% set_mode("regression") %>% set_engine("nnet") nnet_wflow <- soc_wf %>% add_model(nnet_spec) nnet_res <- tune_grid( object = nnet_wflow, resamples = cv_folds, grid = 20, control = ctrl_grid ) ``` ### Fitting all base models We will use a simple function called **finalizer()** from [here](https://www.hfshr.xyz/posts/2020-11-30-model-stacking/) that selects the best model, updates the workflow, fits the final model with the best parameters and pulls out the metrics. ```{r} #| warning: false #| error: false finalizer <- function(tuned, wkflow, split, model) { best_mod <- tuned %>% select_best("rmse") final_wf <- wkflow %>% finalize_workflow(best_mod) final_fit <- final_wf %>% last_fit(split) final_fit %>% collect_metrics() %>% mutate(model = model) } ``` We have fitted 100 models, 20 models for each type (glm, random forest, lightgbm,cubist, and neural net). We can do a quick check of how well each of these models performed using finalize() function: ```{r} bind_rows( finalizer(glm_res, glm_wflow, split, "glm"), finalizer(rf_res, rf_wflow, split, "rf"), finalizer(lightgbm_res, lightgbm_wflow, split, "lightgbm"), finalizer(cubist_res, cubist_wflow, split, "cubist"), finalizer(svm_res, svm_wflow, split, "svm"), finalizer(nnet_res, nnet_wflow, split, "nnet"), ) %>% select(model, .metric, .estimate) %>% pivot_wider(names_from = .metric, values_from = .estimate) %>% arrange(rmse) ``` We can see that the cubist model has the best performance, followed by the rf. ### Stack Ensemble Model First, we start by initialising the stack with `stacks()` then add each candidate model with `add_candidates()`. Next we evaluate the candidate models with `blend_predictions()`, before finally training the non-zero members on the training data with `fit_members()`. ```{r} soc_stack <- # initialize the stack stacks() %>% # add each of the models add_candidates(glm_res) %>% add_candidates(rf_res) %>% add_candidates(lightgbm_res) %>% add_candidates(cubist_res) %>% add_candidates(svm_res) %>% add_candidates(nnet_res) %>% blend_predictions(penalty = range(c(0.5, 5),levels=50, mixture = 0), metric = metric_set(rmse)) %>% # evaluate candidate models fit_members() # fit non zero stacking coefficients ``` ```{r} soc_stack ``` ```{r} #| warning: false #| error: false #| fig.width: 4.5 #| fig.height: 4.5 library(see) # for nice theme autoplot(soc_stack, type = "weights") + theme_lucid() ``` ### Prediction #### Stack Ensemble ```{r} #| warning: false #| error: false #| fig.width: 4.5 #| fig.height: 4.5 reg_metrics <- metric_set(rmse, rsq) # get predictions with stack soc_pred <- predict(soc_stack, test) %>% bind_cols(test) ``` ```{r} ggplot(soc_pred, aes(x = SOC, y = .pred)) + geom_point(alpha = 0.4) + coord_obs_pred() + labs(x = "Observed", y = "Predicted") + geom_abline(linetype = "dashed") + theme_lucid() ``` #### All candidate member models ```{r} member_preds <- test %>% select(SOC) %>% bind_cols( predict( soc_stack, test, members = TRUE ) ) ``` ### Model Performances #### RMSE ```{r} colnames(member_preds) %>% map_dfr( .f = rmse, truth = SOC, data = member_preds ) %>% mutate(member = colnames(member_preds)) ``` #### R2 ```{r} colnames(member_preds) %>% map_dfr( .f = rsq, truth = SOC, data = member_preds ) %>% mutate(member = colnames(member_preds)) ``` ### Further Reading 1. [A Gentle Introduction to Tidy Model Stacking](https://blog--simonpcouch.netlify.app/blog/gentle-intro-stacks/) 2. [Ensembles of Models](https://www.tmwr.org/ensembles.html) 3. [When one model is not enough: Stacking models with stacks](https://www.hfshr.xyz/posts/2020-11-30-model-stacking/)