Regression Trees

A regression tree is a decision tree used for predicting continuous numerical values rather than categorical ones. Like a decision tree, a regression tree consists of nodes that represent tests on input features. These branches represent possible outcomes of these tests, and leaf nodes that represent final predictions.

The tree is constructed by recursively splitting the data into subsets based on the most informative feature or attribute until a stopping criterion is met. At each split, the goal is to minimize the variance of the target variable within each subset. This is typically done by finding the split that results in the most significant reduction in variance. Once the tree is constructed, it can be used to make predictions by following the path from the root node to a leaf node that corresponds to the input features. The prediction is then the mean or median of the target values of the training examples that belong to the leaf node.

Here are the steps involved in building a regression tree:

Select a split point: Select a feature that will best split the data into two groups. The goal is to minimize the variance of each group. At each node, the algorithm selects a split point for one of the input variables that divides the data into two subsets, such that the variance of the target variable is minimized within each subset.
Calculate the split criteria: A split criteria is calculated to determine the quality of the split. The most commonly used split criteria is the mean squared error (MSE) or the sum of squared errors (SSE).
Repeat for each subset (step 1 and 2): The algorithm then recursively repeats this process on each subset until a stopping criterion is met. This stopping criterion could be a maximum tree depth or a minimum number of data points in a node.
Prediction: Once the tree is built, the prediction is made by traversing the tree from the root node to a leaf node, which corresponds to a final prediction. The prediction is the mean or median value of the target variable within the leaf node.

Regression trees have the advantage of being easy to interpret and visualize, and they can handle both categorical and continuous input features. However, they can be prone to overfitting the training data, especially if the tree becomes too deep. Techniques such as pruning and setting a minimum number of examples per leaf node can be used to prevent overfitting.

Data

In this exercise, we will use following data set.

gp_soil_data.csv

Code

library(tidyverse)
# define file from my github
urlfile = "https://github.com//zia207/r-colab/raw/main/Data/USA/gp_soil_data.csv"
mf<-read_csv(url(urlfile))
# Create a data-frame
df<-mf %>% dplyr::select(SOC, DEM, Slope, TPI,MAT, MAP,NDVI, NLCD, FRG)%>%
    glimpse()

Rows: 467
Columns: 9
$ SOC   <dbl> 15.763, 15.883, 18.142, 10.745, 10.479, 16.987, 24.954, 6.288, 2…
$ DEM   <dbl> 2229.079, 1889.400, 2423.048, 2484.283, 2396.195, 2360.573, 2254…
$ Slope <dbl> 5.6716146, 8.9138117, 4.7748051, 7.1218114, 7.9498644, 9.6632147…
$ TPI   <dbl> -0.08572358, 4.55913162, 2.60588670, 5.14693117, 3.75570583, 6.5…
$ MAT   <dbl> 4.5951686, 3.8599243, 0.8855000, 0.4707811, 0.7588266, 1.3586667…
$ MAP   <dbl> 468.3245, 536.3522, 859.5509, 869.4724, 802.9743, 1121.2744, 610…
$ NDVI  <dbl> 0.4139390, 0.6939532, 0.5466033, 0.6191013, 0.5844722, 0.6028353…
$ NLCD  <chr> "Shrubland", "Shrubland", "Forest", "Forest", "Forest", "Forest"…
$ FRG   <chr> "Fire Regime Group IV", "Fire Regime Group IV", "Fire Regime Gro…

Regression tree with rpart

The rpart package is an R package for building classification and regression trees using the Recursive Partitioning And Regression Trees (RPART) algorithm. This algorithm uses a top-down, greedy approach to recursively partition the data into smaller subsets, where each subset is homogeneous with respect to the target variable.

The rpart package provides functions for fitting and predicting with regression trees, as well as for visualizing the resulting tree. Some of the main functions in the rpart package include:

rpart(): This function is used to fit a regression or classification tree to the data. It takes a formula, a data frame, and optional arguments for controlling the tree-building process.

predict.rpart(): This function is used to predict the target variable for new data using a fitted regression or classification tree.

printcp(): This function is used to print the complexity parameter table for a fitted regression or classification tree. This table shows the cross-validated error rate for different values of the complexity parameter, which controls the size of the tree.

plot(): This function is used to visualize a fitted regression or classification tree. It produces a graphical representation of the tree with the nodes and branches labeled.

install.package(“rpart”)

install.package(“rpart.plot”)

Code

library(rpart) #for fitting decision trees
library(rpart.plot) #for plotting decision trees

Split the data

Code

library(tidymodels)
set.seed(1245)   # for reproducibility
split.df <- initial_split(df, prop = 0.8, strata = SOC)
train.df <- split.df %>% training()
test.df <-  split.df %>% testing()

Build the initial regression tree

The rpart() function takes as input a formula specifying the target variable to be predicted and the predictors to be used in the model, as well as a data set containing the observations. It then recursively partitions the data based on the predictors, creating a tree structure in which each internal node represents a decision based on a predictor and each leaf node represents a predicted value for the target variable.

First, we’ll build a large initial regression tree. We can ensure that the tree is large by using a small value for cp, which stands for “complexity parameter.”

We’ll then use the printcp() function to print the results of the model:

Code

in.fit <- rpart(SOC ~ ., data=train.df, 
              control=rpart.control(cp=.0001),
              method = "anova")

#view results
printcp(in.fit)


Regression tree:
rpart(formula = SOC ~ ., data = train.df, method = "anova", control = rpart.control(cp = 1e-04))

Variables actually used in tree construction:
[1] DEM   MAP   MAT   NDVI  Slope TPI  

Root node error: 9299.7/371 = 25.067

n= 371 

           CP nsplit rel error  xerror     xstd
1  0.26043041      0   1.00000 1.01148 0.115282
2  0.08739231      1   0.73957 0.76861 0.091215
3  0.03981695      2   0.65218 0.71171 0.089400
4  0.02968373      3   0.61236 0.72457 0.087274
5  0.02772462      4   0.58268 0.72593 0.086536
6  0.01720801      5   0.55495 0.69395 0.079646
7  0.01431336      6   0.53774 0.71741 0.082321
8  0.01427435      7   0.52343 0.71149 0.082248
9  0.01138460      9   0.49488 0.70991 0.082360
10 0.00941973     10   0.48350 0.71513 0.084972
11 0.00828430     11   0.47408 0.71070 0.084374
12 0.00726562     12   0.46579 0.71850 0.084522
13 0.00668445     14   0.45126 0.74686 0.085528
14 0.00656314     15   0.44458 0.73642 0.083937
15 0.00514674     17   0.43145 0.73066 0.083153
16 0.00410671     19   0.42116 0.74794 0.083536
17 0.00367566     20   0.41705 0.76025 0.084265
18 0.00361023     21   0.41338 0.75826 0.084291
19 0.00253657     23   0.40615 0.76433 0.084755
20 0.00246964     25   0.40108 0.76479 0.084621
21 0.00203912     26   0.39861 0.76680 0.084640
22 0.00162964     27   0.39657 0.76987 0.084323
23 0.00117553     28   0.39494 0.77229 0.084231
24 0.00064774     29   0.39377 0.77535 0.084544
25 0.00061119     30   0.39312 0.77522 0.084535
26 0.00042612     31   0.39251 0.77570 0.084534
27 0.00010000     32   0.39208 0.77506 0.084536

We can use the plotcp() function to visualize the cross-validated error for each level of tree complexity:

Code

plotcp(in.fit)

Prune the tree

Pruning is a technique used to reduce the size of a regression tree by removing some of the branches that do not contribute much to the prediction accuracy. Pruning can help to avoid overfitting and improve the generalization performance of the model.

Next, we’ll prune the regression tree to find the optimal value to use for cp (the complexity parameter) that leads to the lowest test error. Use the prune() function to prune the tree to the desired complexity level. For example, to prune the tree to the level with the lowest cross-validated error.

Note that the optimal value for cp is the one that leads to the lowest xerror in the previous output, which represents the error on the observations from the cross-validation data.

First, #identify best cp value to use:

Code

best_cp <-in.fit$cptable[which.min(in.fit$cptable[,"xerror"]),"CP"]
best_cp

[1] 0.01720801

We use use this cp value to prune the tree:

Code

pruned.fit <- rpart::prune(in.fit, cp = best_cp)
summary(pruned.fit)

Call:
rpart(formula = SOC ~ ., data = train.df, method = "anova", control = rpart.control(cp = 1e-04))
  n= 371 

          CP nsplit rel error    xerror       xstd
1 0.26043041      0 1.0000000 1.0114796 0.11528232
2 0.08739231      1 0.7395696 0.7686088 0.09121486
3 0.03981695      2 0.6521773 0.7117057 0.08939984
4 0.02968373      3 0.6123603 0.7245669 0.08727429
5 0.02772462      4 0.5826766 0.7259339 0.08653580
6 0.01720801      5 0.5549520 0.6939516 0.07964616

Variable importance
 NDVI   MAP   DEM Slope   MAT   FRG  NLCD   TPI 
   36    19    15    10     6     6     5     2 

Node number 1: 371 observations,    complexity param=0.2604304
  mean=6.324235, MSE=25.06656 
  left son=2 (296 obs) right son=3 (75 obs)
  Primary splits:
      NDVI  < 0.5842654 to the left,  improve=0.2604304, (0 missing)
      MAP   < 532.8792  to the left,  improve=0.2155606, (0 missing)
      Slope < 7.58043   to the left,  improve=0.1564041, (0 missing)
      DEM   < 2224.082  to the left,  improve=0.1470950, (0 missing)
      NLCD  splits as  RLLL, improve=0.1438367, (0 missing)
  Surrogate splits:
      MAP   < 571.6627  to the left,  agree=0.881, adj=0.413, (0 split)
      DEM   < 434.4941  to the right, agree=0.860, adj=0.307, (0 split)
      Slope < 13.61135  to the left,  agree=0.819, adj=0.107, (0 split)
      TPI   < -12.78235 to the right, agree=0.801, adj=0.013, (0 split)

Node number 2: 296 observations,    complexity param=0.08739231
  mean=5.038125, MSE=15.01387 
  left son=4 (134 obs) right son=5 (162 obs)
  Primary splits:
      NDVI  < 0.3528117 to the left,  improve=0.18287630, (0 missing)
      MAP   < 387.1482  to the left,  improve=0.11492150, (0 missing)
      Slope < 4.752073  to the left,  improve=0.10102130, (0 missing)
      DEM   < 2224.159  to the left,  improve=0.08812804, (0 missing)
      MAT   < 4.677448  to the right, improve=0.08112037, (0 missing)
  Surrogate splits:
      MAP  < 379.2133  to the left,  agree=0.807, adj=0.575, (0 split)
      NLCD splits as  RRRL, agree=0.713, adj=0.366, (0 split)
      MAT  < 13.13257  to the right, agree=0.642, adj=0.209, (0 split)
      DEM  < 1279.093  to the right, agree=0.632, adj=0.187, (0 split)
      FRG  splits as  RRRLRL, agree=0.618, adj=0.157, (0 split)

Node number 3: 75 observations,    complexity param=0.03981695
  mean=11.40008, MSE=32.44885 
  left son=6 (62 obs) right son=7 (13 obs)
  Primary splits:
      Slope < 14.4491   to the left,  improve=0.15215140, (0 missing)
      MAT   < 4.337716  to the right, improve=0.12463470, (0 missing)
      TPI   < -6.247421 to the right, improve=0.10937240, (0 missing)
      DEM   < 2662.899  to the left,  improve=0.08923955, (0 missing)
      FRG   splits as  LLRRLL, improve=0.08595199, (0 missing)
  Surrogate splits:
      DEM < 2941.671  to the left,  agree=0.907, adj=0.462, (0 split)
      TPI < 7.98983   to the left,  agree=0.867, adj=0.231, (0 split)

Node number 4: 134 observations
  mean=3.216201, MSE=7.07873 

Node number 5: 162 observations,    complexity param=0.02968373
  mean=6.545148, MSE=16.56071 
  left son=10 (93 obs) right son=11 (69 obs)
  Primary splits:
      Slope < 3.711134  to the left,  improve=0.10289470, (0 missing)
      MAT   < 7.698525  to the right, improve=0.09969410, (0 missing)
      DEM   < 2221.589  to the left,  improve=0.04757807, (0 missing)
      FRG   splits as  LLLRR-, improve=0.04470613, (0 missing)
      TPI   < -6.121288 to the right, improve=0.04145276, (0 missing)
  Surrogate splits:
      DEM  < 1948.863  to the left,  agree=0.901, adj=0.768, (0 split)
      FRG  splits as  RLRRR-, agree=0.901, adj=0.768, (0 split)
      MAT  < 7.698525  to the right, agree=0.877, adj=0.710, (0 split)
      NLCD splits as  RLLR, agree=0.870, adj=0.696, (0 split)
      TPI  < -1.128563 to the right, agree=0.728, adj=0.362, (0 split)

Node number 6: 62 observations
  mean=10.38263, MSE=18.84759 

Node number 7: 13 observations
  mean=16.25254, MSE=68.8329 

Node number 10: 93 observations
  mean=5.420753, MSE=9.754629 

Node number 11: 69 observations,    complexity param=0.02772462
  mean=8.060638, MSE=21.73339 
  left son=22 (62 obs) right son=23 (7 obs)
  Primary splits:
      MAP   < 564.6852  to the left,  improve=0.17193230, (0 missing)
      FRG   splits as  LRLRR-, improve=0.09124332, (0 missing)
      MAT   < 1.144284  to the right, improve=0.04186447, (0 missing)
      Slope < 8.952535  to the right, improve=0.04069068, (0 missing)
      DEM   < 1515.267  to the right, improve=0.03584497, (0 missing)
  Surrogate splits:
      MAT < 1.144284  to the right, agree=0.971, adj=0.714, (0 split)
      FRG splits as  LLLLR-, agree=0.971, adj=0.714, (0 split)
      DEM < 3242.956  to the left,  agree=0.913, adj=0.143, (0 split)

Node number 22: 62 observations
  mean=7.411113, MSE=16.42328 

Node number 23: 7 observations
  mean=13.81357, MSE=31.93295

We can use rpart.plot() function to plot the regression tree:

Code

rpart.plot(pruned.fit)

Prediction

Code

test.df$SOC.tree<-predict(pruned.fit, test.df)

Code

RMSE<- Metrics::rmse(test.df$SOC, test.df$SOC.tree)
RMSE

[1] 3.522476

Code

library(ggpmisc)
formula<-y~x

ggplot(test.df, aes(SOC,SOC.tree)) +
  geom_point() +
  geom_smooth(method = "lm")+
  stat_poly_eq(use_label(c("eq", "adj.R2")), formula = formula) +
  ggtitle("Regression Tree") +
  xlab("Observed") + ylab("Predicted") +
  scale_x_continuous(limits=c(0,25), breaks=seq(0, 25, 5))+ 
  scale_y_continuous(limits=c(0,25), breaks=seq(0, 25, 5)) +
  # Flip the bars
  theme(
    panel.background = element_rect(fill = "grey95",colour = "gray75",size = 0.5, linetype = "solid"),
    axis.line = element_line(colour = "grey"),
    plot.title = element_text(size = 14, hjust = 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x=element_text(size=13, colour="black"),
    axis.text.y=element_text(size=13,angle = 90,vjust = 0.5, hjust=0.5, colour='black'))

Regression Tree with Caret Package

Set control Parameters

Code

library(caret)
set.seed(123)
train.control <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 5,
                              preProc = c("center", "scale", "nzv"))

Train the model

Code

set.seed(2)
train.rpart <- caret::train(SOC ~., data = train.df,
                     method = "rpart",
                     tuneLength = 50,
                     trControl = train.control ,
                     tuneGrid = expand.grid(cp = seq(0,0.4,0.01)))

Code

train.rpart

CART 

371 samples
  8 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 334, 333, 334, 334, 334, 334, ... 
Resampling results across tuning parameters:

  cp    RMSE      Rsquared      MAE     
  0.00  4.509316  2.769299e-01  3.231612
  0.01  4.323208  3.072242e-01  3.032395
  0.02  4.257813  3.158449e-01  2.998006
  0.03  4.314825  2.956942e-01  3.049286
  0.04  4.294635  2.931849e-01  3.022898
  0.05  4.253645  3.052001e-01  3.017513
  0.06  4.201570  3.150752e-01  3.011249
  0.07  4.173444  3.175855e-01  3.004081
  0.08  4.222658  3.033084e-01  3.043568
  0.09  4.388515  2.488704e-01  3.191270
  0.10  4.405598  2.423928e-01  3.243067
  0.11  4.405598  2.423928e-01  3.243067
  0.12  4.405598  2.423928e-01  3.243067
  0.13  4.405598  2.423928e-01  3.243067
  0.14  4.405598  2.423928e-01  3.243067
  0.15  4.405598  2.423928e-01  3.243067
  0.16  4.405598  2.423928e-01  3.243067
  0.17  4.405598  2.423928e-01  3.243067
  0.18  4.405598  2.423928e-01  3.243067
  0.19  4.405598  2.423928e-01  3.243067
  0.20  4.405598  2.423928e-01  3.243067
  0.21  4.405598  2.423928e-01  3.243067
  0.22  4.405598  2.423928e-01  3.243067
  0.23  4.459716  2.308527e-01  3.287461
  0.24  4.480471  2.278248e-01  3.300437
  0.25  4.646920  1.945108e-01  3.441228
  0.26  4.796143  1.580023e-01  3.587896
  0.27  4.921510  1.206834e-01  3.721628
  0.28  4.960463  7.138244e-02  3.760534
  0.29  4.969457  1.861139e-05  3.789281
  0.30  4.960423           NaN  3.787247
  0.31  4.960423           NaN  3.787247
  0.32  4.960423           NaN  3.787247
  0.33  4.960423           NaN  3.787247
  0.34  4.960423           NaN  3.787247
  0.35  4.960423           NaN  3.787247
  0.36  4.960423           NaN  3.787247
  0.37  4.960423           NaN  3.787247
  0.38  4.960423           NaN  3.787247
  0.39  4.960423           NaN  3.787247
  0.40  4.960423           NaN  3.787247

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.07.

Code

rpart.plot(train.rpart$finalModel)

Regression Tree with tidymodel

The tidymodels provides a comprehensive framework for building, tuning, and fit regression tree models while following the principles of the tidyverse.

Split data

Code

library(tidymodels)
set.seed(1245)   # for reproducibility
split <- initial_split(df, prop = 0.8, strata = SOC)
train <- split %>% training()
test <-  split %>% testing()

# Set 10 fold cross-validation data set 

cv_folds <- vfold_cv(train, v = 10)

Create Recipe

A recipe is a description of the steps to be applied to a data set in order to prepare it for data analysis. Before training the model, we can use a recipe to do some preprocessing required by the model.

Code

# load library
library(tidymodels)

# Create a recipe
tree_recipe <-
  recipe(SOC ~ ., data = train) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal()) %>%
  step_normalize(all_numeric_predictors())

Specify tunable decision tree model

decision_tree() from parsnip package (installed with tidymodels) defines a model as a set of ⁠if/then⁠ statements that creates a tree-based structure. This function can fit classification, regression, and censored regression models. We will use rpart to create a regression tree model

Code

tree_model<- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart") %>%
  set_mode("regression")
tree_model

Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = tune()
  tree_depth = tune()
  min_n = tune()

Computational engine: rpart

Define workflow

Code

tree_wf <- workflow() %>%
  add_recipe(tree_recipe) %>%
  add_model(tree_model)

Define possible grid parameter

We use grid_regular() function of dials package (installed with tidymodesl) to create grids of tuning parameters

Code

tree_grid <- grid_regular(cost_complexity(), 
                          tree_depth(), 
                          min_n(), 
                          levels = 4)
head(tree_grid)

# A tibble: 6 × 3
  cost_complexity tree_depth min_n
            <dbl>      <int> <int>
1    0.0000000001          1     2
2    0.0000001             1     2
3    0.0001                1     2
4    0.1                   1     2
5    0.0000000001          5     2
6    0.0000001             5     2

Model tuning via grid search

Now we will fit the models with all the possible parameter values on all our resampled (cv.fold) datasets. tune_grid() of tune package (installed with tidymodels) computes a set of performance metrics (e.g. accuracy or RMSE) for a pre-defined set of tuning parameters that correspond to a model or recipe across one or more resamples of the data.

tune_grid() computes a set of performance metrics (e.g. accuracy or RMSE) for a pre-defined set of tuning parameters that correspond to a model or recipe across one or more resamples of the data.

We will use registerDoParallel() function that register the parallel backend with the foreach package.

install.pckages(“doParallel”)

Code

#

# register the parallel backend
doParallel::registerDoParallel()

set.seed(345)
tree_tune <- tune_grid(
  tree_wf,
  resamples = cv_folds,
  grid = tree_grid,
  metrics = metric_set(rmse, rsq, mae, mape)
)
tree_tune

# Tuning results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits           id     .metrics           .notes          
   <list>           <chr>  <list>             <list>          
 1 <split [333/38]> Fold01 <tibble [256 × 7]> <tibble [0 × 3]>
 2 <split [334/37]> Fold02 <tibble [256 × 7]> <tibble [0 × 3]>
 3 <split [334/37]> Fold03 <tibble [256 × 7]> <tibble [0 × 3]>
 4 <split [334/37]> Fold04 <tibble [256 × 7]> <tibble [0 × 3]>
 5 <split [334/37]> Fold05 <tibble [256 × 7]> <tibble [0 × 3]>
 6 <split [334/37]> Fold06 <tibble [256 × 7]> <tibble [0 × 3]>
 7 <split [334/37]> Fold07 <tibble [256 × 7]> <tibble [0 × 3]>
 8 <split [334/37]> Fold08 <tibble [256 × 7]> <tibble [0 × 3]>
 9 <split [334/37]> Fold09 <tibble [256 × 7]> <tibble [0 × 3]>
10 <split [334/37]> Fold10 <tibble [256 × 7]> <tibble [0 × 3]>

Evaluate model

collect_metrics() of tune package (installed with tidymodels) obtain and format results by tuneing function:

Code

collect_metrics(tree_tune)

# A tibble: 256 × 9
   cost_complexity tree_depth min_n .metric .estimator    mean     n std_err
             <dbl>      <int> <int> <chr>   <chr>        <dbl> <int>   <dbl>
 1    0.0000000001          1     2 mae     standard     3.19     10  0.175 
 2    0.0000000001          1     2 mape    standard   103.       10  8.73  
 3    0.0000000001          1     2 rmse    standard     4.29     10  0.307 
 4    0.0000000001          1     2 rsq     standard     0.255    10  0.0282
 5    0.0000001             1     2 mae     standard     3.19     10  0.175 
 6    0.0000001             1     2 mape    standard   103.       10  8.73  
 7    0.0000001             1     2 rmse    standard     4.29     10  0.307 
 8    0.0000001             1     2 rsq     standard     0.255    10  0.0282
 9    0.0001                1     2 mae     standard     3.19     10  0.175 
10    0.0001                1     2 mape    standard   103.       10  8.73  
# ℹ 246 more rows
# ℹ 1 more variable: .config <chr>

we can also visualize the tuning parameters:

Code

autoplot(tree_tune) + theme_light(base_family = "IBMPlexSans")

The best tuning parameters

show_best() of tune package displays the top sub-models and their performance estimates.

Code

show_best(tree_tune, "rmse")

# A tibble: 5 × 9
  cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
            <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
1    0.0000000001          1     2 rmse    standard    4.29    10   0.307
2    0.0000001             1     2 rmse    standard    4.29    10   0.307
3    0.0001                1     2 rmse    standard    4.29    10   0.307
4    0.1                   1     2 rmse    standard    4.29    10   0.307
5    0.0000000001          1    14 rmse    standard    4.29    10   0.307
# ℹ 1 more variable: .config <chr>

select_best() finds the tuning parameter combination with the best performance values.

Code

select_best(tree_tune, "rmse")

# A tibble: 1 × 4
  cost_complexity tree_depth min_n .config              
            <dbl>      <int> <int> <chr>                
1    0.0000000001          1     2 Preprocessor1_Model01

Final tree model

Code

tree_final <- finalize_model(tree_model, select_best(tree_tune, "rmse"))
tree_final

Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = 1e-10
  tree_depth = 1
  min_n = 2

Computational engine: rpart

We can either fit final_tree to training data using fit() or to the testing/training split using last_fit(), which will give us some other results along with the fitted output.

Code

final_fit <- fit(tree_final, SOC ~ .,train)
final_reg <- last_fit(tree_final, SOC ~ ., split)

Code

collect_metrics(final_reg)

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       4.50  Preprocessor1_Model1
2 rsq     standard       0.243 Preprocessor1_Model1

Prediction

Code

predict(final_fit,test)

# A tibble: 96 × 1
   .pred
   <dbl>
 1 11.4 
 2 11.4 
 3 11.4 
 4  5.04
 5  5.04
 6  5.04
 7  5.04
 8  5.04
 9  5.04
10  5.04
# ℹ 86 more rows

Variable importance plot

vip() function can plot variable importance scores for the predictors in a model.

imstall.package(“vip”)

Code

#| fig.width: 4.5
#| fig.height: 6

library(vip)
final_fit %>%
  vip(geom = "col", aesthetics = list(fill = "midnightblue", alpha = 0.8)) +
  scale_y_continuous(expand = c(0, 0))

Exercise

Create a R-Markdown documents (name homework_13.rmd) in this project and do all Tasks using the data shown below.
Submit all codes and output as a HTML document (homework_13.html) before class of next week.

Required R-Package

tidyverse, caret, Metrics, tidymodels, vip

Data

bd_soil_update.csv

Download the data and save in your project directory. Use read_csv to load the data in your R-session. For example:

mf<-read_csv(“bd_soil_update.csv”)

Tasks

Create a data-frame for a regression tree model of SOC with following variables for Rajshahi Division:

First use filter() to select data from Rajshai division and then use select() functions to create data-frame with following variables:

SOM, DEM, NDVI, NDFI,
Fit a regression tree model with grid search using, caret, tidymodels and h20 packages
Show all steps of data processing, grid search, fit model, prediction and VIP

YouTube Video

Visual Guide to Regression Trees

Source: StatQuest with Josh Starme

Regression Trees

Data

Regression tree with rpart

Split the data

Build the initial regression tree

Prune the tree

Prediction

Regression Tree with Caret Package

Set control Parameters

Train the model

Regression Tree with tidymodel

Split data

Create Recipe

Specify tunable decision tree model

Define workflow

Define possible grid parameter

Model tuning via grid search

Evaluate model

The best tuning parameters

Final tree model

Prediction

Variable importance plot

Exercise

Required R-Package

Data

Tasks

Further Reading

YouTube Video