2. Regularized Logistic Regression

Logistic regression is a fundamental machine learning technique used to model binary outcomes, such as predicting whether an email is spam or not. While traditional logistic regression is effective, it can overfit when dealing with high-dimensional datasets or highly correlated predictors. Regularization addresses this issue by adding a penalty term to the model, which shrinks the coefficients to prevent overfitting and enhance generalization. In this tutorial, we will explore Regularized Logistic Regression in R. We will cover two approaches: (1) Fitting a Regularized Logistic Model from scratch to understand the underlying mathematics and writing custom R code to implement a regularized logistic regression model using optimization techniques; (2) Utilizing the {glmnet} package to efficiently implement L1 (Lasso), L2 (Ridge), and Elastic Net regularization. This tutorial balances theory and practice, helping you grasp the core concepts of regularized logistic regression while equipping you with the skills to apply it effectively in R.

Overview

Regularized logistic regression, also known as penalized logistic regression, is an improved version of traditional logistic regression that employs regularization techniques to prevent overfitting and enhance model generalization. In logistic regression, the aim is to model the probability of a binary outcome based on one or more predictor variables. However, traditional logistic regression models may encounter high variance and poor generalization performance when dealing with datasets with many predictors or multicollinearity among predictors.

Regularization addresses these issues by adding a penalty term to the logistic regression objective function, penalizing the regression coefficients’ magnitudes. This penalty encourages the model to shrink the coefficients towards zero or enforce sparsity, effectively reducing the complexity of the model and preventing overfitting. As a result, regularized logistic regression models can achieve better generalization performance and are less prone to overfitting than traditional logistic regression models.There are two types of regularization commonly used in logistic regression: L1 regularization (LASOO) and L2 regularization (Ridge). Elastic Net regularization combines both L1 and L2 regularization and balances feature selection and coefficient shrinkage.

In ridge regression, the penalty term shrinks the coefficients, contributing the most to the error. This results in a reduced magnitude of these coefficients, decreasing the impact of features that contribute to increasing the error.

On the other hand, LASSO regression employs a different approach. Instead of shrinking the coefficients, it sets them exactly to zero! This unique property of LASSO model makes it an effective feature selector that identifies the most important coefficients with the lowest p-values, eliminating the features that have little impact on the model’s predictive power.

Penalized logistic regression extends traditional logistic regression by adding a penalty term to the log-likelihood function. This penalty discourages large coefficients and helps prevent overfitting.

The logit link function in glm() models the log-odds:

\[ \text{logit}(P) = \ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p \],

where:

  • \(P = P(Y = 1 \mid \mathbf{X}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)}}\) is the probability of success.

  • \(\beta_0\): Intercept term.

  • \(\beta_j\): Coefficients for predictors \(X_j\).

The log-likelihood function for logistic regression is:

\[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ Y_i \ln(\hat{P}_i) + (1 - Y_i) \ln(1 - \hat{P}_i) \right] \],

where \(\hat{P}_i = P(Y_i = 1 \mid \mathbf{X}_i)\).

To perform penalized logistic regression, a penalty term is added to the negative log-likelihood to shrink coefficients \(\beta_j\).

The penalized negative log-likelihood is:

\[ \text{Penalized Loss} = -\ell(\boldsymbol{\beta}) + \lambda \cdot P(\boldsymbol{\beta}) \]

where:

  • \(P(\boldsymbol{\beta})\): Penalty function based on the coefficients.

  • \(\lambda\): Regularization strength, controlling the trade-off between goodness of fit and penalty.

L1 Regularization (Lasso):

\[ P(\boldsymbol{\beta}) = \sum_{j=1}^p |\beta_j|\]

Encourages sparsity (some coefficients shrink to zero). Suitable for feature selection.

L2 Regularization (Ridge):

\[ P(\boldsymbol{\beta}) = \sum_{j=1}^p \beta_j^2\]

Shrinks coefficients but keeps all predictors. Helps handle multicollinearity.

Elastic Net (Combination of L1 and L2):

\[ P(\boldsymbol{\beta}) = \alpha \sum_{j=1}^p |\beta_j| + (1 - \alpha) \sum_{j=1}^p \beta_j^2 \]

where \(0 \leq \alpha \leq 1\) balances L1 and L2 penalties.

Penalized logistic regression maximizes the penalized log-likelihood:

\[ \hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}} \left( \ell(\boldsymbol{\beta}) - \lambda \cdot P(\boldsymbol{\beta}) \right) \]

This optimization is performed numerically. In R, packages like glmnet handle this process efficiently.

Regularized Logistic Model from Scratch

Building a regularized logistic regression model (ridge, lasso, and elastic net) from scratch in R without any external packages involves implementing:

Here’s how this can be done:

Simulated Data

Simulate a dataset with a binary response variable and multiple predictor variables. The response variable is generated using the logistic function with the true coefficients. The predictor matrix X is generated from a standard normal distribution.

Code
# Set seed for reproducibility
set.seed(123)

# Step 1: Generate synthetic data for binary logistic regression
n <- 200  # Number of observations
p <- 10   # Number of features
X <- matrix(rnorm(n * p), nrow = n, ncol = p)  # Design matrix (n x p)
beta_true <- c(3, -2, 1, rep(0, p-3))  # True coefficients (sparse)
y_prob <- 1 / (1 + exp(-X %*% beta_true))  # Logistic function for probabilities
y <- rbinom(n, 1, y_prob)  # Binary outcome variable

Define log-likelihood and penalty functions

Code
# Step 2: Define log-likelihood and penalty functions
# Logistic regression log-likelihood
log_likelihood <- function(X, y, beta) {
  p <- 1 / (1 + exp(-X %*% beta))  # Predicted probabilities
  ll <- sum(y * log(p) + (1 - y) * log(1 - p))
  return(ll)
}

# L1 (Lasso) penalty
l1_penalty <- function(beta) {
  return(sum(abs(beta)))
}

# L2 (Ridge) penalty
l2_penalty <- function(beta) {
  return(sum(beta^2))
}

# Elastic Net penalty
elastic_net_penalty <- function(beta, alpha) {
  return(alpha * l1_penalty(beta) + (1 - alpha) * l2_penalty(beta))
}

Coordinate Descent Algorithm

Code
#  Coordinate Descent Algorithm
coordinate_descent <- function(X, y, lambda, alpha, tol = 1e-6, max_iter = 1000) {
  n <- nrow(X)
  p <- ncol(X)
  beta <- rep(0, p)
  X <- scale(X)  # Standardize features
  y <- y - mean(y)  # Center target
  
  for (iter in 1:max_iter) {
    beta_old <- beta
    for (j in 1:p) {
      r_j <- y - X %*% beta + beta[j] * X[, j]
      z_j <- sum(X[, j]^2)
      rho_j <- sum(X[, j] * r_j)
      
      # Update beta based on the type of model (Ridge, Lasso, ElasticNet)
      beta[j] <- ifelse(alpha == 1, # Lasso
                        sign(rho_j) * max(abs(rho_j) - lambda, 0) / z_j,
                        ifelse(alpha == 0, # Ridge
                               rho_j / (z_j + lambda),
                               # Elastic Net
                               sign(rho_j) * max(abs(rho_j) - lambda * alpha, 0) / (z_j + lambda * (1 - alpha))))
    }
    if (sum(abs(beta - beta_old)) < tol) break
  }
  return(beta)
}

Cross-validation for hyperparameter tuning

Code
# Step 4: Cross-validation for hyperparameter tuning
cross_validate <- function(X, y, lambdas, alphas, k_folds = 5) {
  n <- nrow(X)
  fold_size <- floor(n / k_folds)
  
  best_lambda <- NULL
  best_alpha <- NULL
  best_accuracy <- -Inf
  
  # Loop over all combinations of lambda and alpha
  for (lambda in lambdas) {
    for (alpha in alphas) {
      accuracy <- 0
      for (i in 1:k_folds) {
        # Split data into training and validation
        test_idx <- ((i - 1) * fold_size + 1):(i * fold_size)
        X_train <- X[-test_idx, ]
        y_train <- y[-test_idx]
        X_test <- X[test_idx, ]
        y_test <- y[test_idx]
        
        # Fit model with current lambda and alpha
        beta <- coordinate_descent(X_train, y_train, lambda, alpha)
        
        # Make predictions
        p_test <- 1 / (1 + exp(-X_test %*% beta))
        y_pred <- ifelse(p_test > 0.5, 1, 0)
        
        # Calculate accuracy
        accuracy <- accuracy + mean(y_pred == y_test)
      }
      
      accuracy <- accuracy / k_folds  # Average accuracy across folds
      
      # Track best parameters
      if (accuracy > best_accuracy) {
        best_accuracy <- accuracy
        best_lambda <- lambda
        best_alpha <- alpha
      }
    }
  }
  
  return(list(lambda = best_lambda, alpha = best_alpha))
}

Run Regularized Logistic Regression with Cross-Validation

Run the regularized logistic regression model with cross-validation to select the best hyperparameters (alpha and lambda). The model is trained on the standardized predictor matrix X_scaled and the response variable y. The grid of hyperparameters includes five values for alpha (0 to 1) and ten values for lambda (\(10^{-2}\) to \(10^2\)). The best alpha and lambda values are selected based on the lowest average cross-validation loss.

Code
X_scaled <- scale(X)

# Step 5: Fit the models (Ridge, Lasso, Elastic Net) and tune hyperparameters
lambdas <- seq(0.001, 1, by = 0.01)  # Regularization strength
alphas <- c(0, 0.5, 1)  # Lasso (1), Ridge (0), Elastic Net (0.5)

# Cross-validation to find the best lambda and alpha
cv_results <- cross_validate(X_scaled, y, lambdas, alphas)
best_lambda <- cv_results$lambda
best_alpha <- cv_results$alpha

cat("Best lambda:", best_lambda, "Best alpha:", best_alpha, "\n")
Best lambda: 0.111 Best alpha: 1 

Final Model Fit with best alpha and lambda

Final model is fitted with the best alpha and lambda values selected from cross-validation. The model coefficients are displayed to show the impact of each predictor on the outcome.

Code
# Fit final model with best alpha and lambda
final_beta <- coordinate_descent(X_scaled, y, best_lambda, best_alpha)
cat("Final Coefficients:\n", final_beta, "\n")
Final Coefficients:
 0.2961323 -0.1916574 0.05642973 0 -0.04203506 0.007707405 0.01378486 0.007593126 -0.02713731 0.01367677 

Fit Ridge, Lasso, and Elastic Net Models

Fit Ridge, Lasso, and Elastic Net models using the best lambda and alpha values selected from cross-validation. The models are used to compute the predicted probabilities for the response variable.

Code
# Fit Ridge, Lasso, and Elastic Net models
ridge_beta <- coordinate_descent(X_scaled, y, lambda = best_lambda, alpha = 0)
lasso_beta <- coordinate_descent(X_scaled, y, lambda = best_lambda, alpha = 1)
elastic_net_beta <- coordinate_descent(X_scaled, y, lambda = best_lambda, alpha = best_alpha)

Sigmoid function for prediction

Code
# Evaluate the models
sigmoid <- function(model, X) {
  pred <- 1 / (1 + exp(-X %*% model))
  return(pred)
}
Code
# Prediction function for Logistic Regression
predict_logistic<- function(model ) {
  # Calculate predicted probabilities using the logistic function
  p <- 1 / (1 + exp(-X_new %*% beta))  # Logistic (sigmoid) function
  
  # Convert probabilities to binary class predictions
  predictions <- ifelse(p >= 0.5, 1, 0)
  
  return(predictions)
}
Code
# Compute predictions
ridge_pred_prob <- sigmoid (ridge_beta, X_scaled)
lasso_pred_prob <- sigmoid (lasso_beta, X_scaled) 
elastic_net_pred_prob <- sigmoid (elastic_net_beta, X_scaled) 

Function to Evaluate Model Performance

Evaluate the model performance using metrics such as accuracy, precision, recall, sensitivity, specificity, and F1 score. The confusion matrix is computed to show the true positive, true negative, false positive, and false negative predictions.

Code
# Function to compute confusion matrix and performance metrics
evaluate_model <- function(y_true, y_pred_prob, threshold = 0.5) {
  # Convert probabilities to binary predictions
  y_pred <- ifelse(y_pred_prob >= threshold, 1, 0)
  
  # Confusion matrix
  confusion <- table(Predicted = y_pred, Actual = y_true)
  
  # Metrics
  TP <- confusion[2, 2]
  TN <- confusion[1, 1]
  FP <- confusion[2, 1]
  FN <- confusion[1, 2]
  
  accuracy <- (TP + TN) / sum(confusion)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  sensitivity <- recall  # Sensitivity is the same as recall
  specificity <- TN / (TN + FP)
  F1 <- 2 * (precision * recall) / (precision + recall)
  
  metrics <- list(
    ConfusionMatrix = confusion,
    Accuracy = accuracy,
    Precision = precision,
    Recall = recall,
    Sensitivity = sensitivity,
    Specificity = specificity,
    F1_Score = F1
  )
  
  return(metrics)
}

Evaluate Model Performance

Code
# Evaluate Ridge model
ridge_metrics <- evaluate_model(y, ridge_pred_prob)

# Evaluate Lasso model
lasso_metrics <- evaluate_model(y, lasso_pred_prob)

# Evaluate Elastic Net model
elastic_net_metrics <- evaluate_model(y, elastic_net_pred_prob)

Display Model Performance

Ridge Model Performance

Code
# Display results for Ridge
cat("Ridge Model Performance:\n")
Ridge Model Performance:
Code
print(ridge_metrics$ConfusionMatrix)
         Actual
Predicted  0  1
        0 87  9
        1 16 88
Code
cat(sprintf("Accuracy: %.3f, Precision: %.3f, Recall: %.3f, Sensitivity: %.3f, F1 Score: %.3f\n",
            ridge_metrics$Accuracy, ridge_metrics$Precision, ridge_metrics$Recall,
            ridge_metrics$Sensitivity, ridge_metrics$F1_Score))
Accuracy: 0.875, Precision: 0.846, Recall: 0.907, Sensitivity: 0.907, F1 Score: 0.876

Lasso Model Performance

Code
# Display results for Lasso
cat("\nLasso Model Performance:\n")

Lasso Model Performance:
Code
print(lasso_metrics$ConfusionMatrix)
         Actual
Predicted  0  1
        0 88  9
        1 15 88
Code
cat(sprintf("Accuracy: %.3f, Precision: %.3f, Recall: %.3f, Sensitivity: %.3f, F1 Score: %.3f\n",
            lasso_metrics$Accuracy, lasso_metrics$Precision, lasso_metrics$Recall,
            lasso_metrics$Sensitivity, lasso_metrics$F1_Score))
Accuracy: 0.880, Precision: 0.854, Recall: 0.907, Sensitivity: 0.907, F1 Score: 0.880

Elastic Net Model Performance

Code
# Display results for Elastic Net
cat("\nElastic Net Model Performance:\n")

Elastic Net Model Performance:
Code
print(elastic_net_metrics$ConfusionMatrix)
         Actual
Predicted  0  1
        0 88  9
        1 15 88
Code
cat(sprintf("Accuracy: %.3f, Precision: %.3f, Recall: %.3f, Sensitivity: %.3f, F1 Score: %.3f\n",
            elastic_net_metrics$Accuracy, elastic_net_metrics$Precision, elastic_net_metrics$Recall,
            elastic_net_metrics$Sensitivity, elastic_net_metrics$F1_Score))
Accuracy: 0.880, Precision: 0.854, Recall: 0.907, Sensitivity: 0.907, F1 Score: 0.880

Regularized GLM Regression Model in R

To fit a Regularized Logistic Model R, we will use {glmnet} package. The {glmnet} package provides efficient functions for fitting generalized linear models (GLMs) with L1 (Lasso) and L2 (Ridge) regularization. The package is widely used for regression and classification tasks, especially when dealing with high-dimensional data or multicollinearity.

Install Required R Packages

Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:

Code
packages <- c('tidyverse',
     'plyr',
         'rstatix',
         'ggeffects',
         'patchwork',
         'metrica',
         'Metrics',
         'glmnet',
         'ROCR',
         'pROC'
        )
#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Load Pacakges

Code
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))
Code
# Check loaded packages
cat("Successfully loaded packages:\n")
Successfully loaded packages:
Code
print(search()[grepl("package:", search())])
 [1] "package:pROC"      "package:ROCR"      "package:glmnet"   
 [4] "package:Matrix"    "package:Metrics"   "package:metrica"  
 [7] "package:patchwork" "package:ggeffects" "package:rstatix"  
[10] "package:plyr"      "package:lubridate" "package:forcats"  
[13] "package:stringr"   "package:dplyr"     "package:purrr"    
[16] "package:readr"     "package:tidyr"     "package:tibble"   
[19] "package:ggplot2"   "package:tidyverse" "package:stats"    
[22] "package:graphics"  "package:grDevices" "package:utils"    
[25] "package:datasets"  "package:methods"   "package:base"     

Data

Our goal is to develop a regularized logistic model to predict probability of paddy soil arsenic contamination (contaminated and non-contaminated) using selected irrigation water and soil properties. We have available data of 263 paired groundwater and paddy soil samples from arsenic contaminated areas in Tala Upazilla, Satkhira district, Bangladesh. This data was utilized in a publication titled “Factors Affecting Paddy Soil Arsenic Concentration in Bangladesh: Prediction and Uncertainty of Geostatistical Risk Mapping” which can be accessed via the this URL.

Full data set is available for download from my Dropbox or from my Github accounts.

We will use read_csv() function of readr package to import data as a tidy data.

Code
# Load data
mf<-readr::read_csv("https://github.com/zia207/r-colab/raw/main/Data/Regression_analysis/bd_soil_arsenic.csv")
glimpse(mf)
Rows: 263
Columns: 29
$ ID              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ Longitude       <dbl> 89.1134, 89.1232, 89.1281, 89.1296, 89.1312, 89.1305, …
$ Latitude        <dbl> 22.7544, 22.7576, 22.7499, 22.7568, 22.7366, 22.7297, …
$ WAs             <dbl> 0.059, 0.059, 0.079, 0.122, 0.072, 0.042, 0.075, 0.064…
$ WP              <dbl> 0.761, 1.194, 1.317, 1.545, 0.966, 1.058, 0.868, 0.890…
$ WFe             <dbl> 3.44, 4.93, 9.70, 8.58, 4.78, 6.95, 7.81, 8.14, 8.99, …
$ WEc             <dbl> 1.03, 1.07, 1.40, 0.83, 1.42, 1.82, 1.71, 1.74, 1.57, …
$ WpH             <dbl> 7.03, 7.06, 6.84, 6.85, 6.95, 6.89, 6.86, 6.98, 6.82, …
$ WMg             <dbl> 33.9, 34.1, 40.5, 28.4, 43.4, 43.2, 50.1, 51.5, 48.2, …
$ WNa             <dbl> 69.4, 74.6, 89.4, 22.8, 93.0, 165.7, 110.5, 127.0, 96.…
$ WCa             <dbl> 99.1, 94.3, 133.9, 103.8, 130.5, 153.3, 172.0, 163.2, …
$ WK              <dbl> 5.3, 5.7, 6.0, 4.0, 5.3, 6.5, 5.9, 6.4, 6.5, 3.7, 4.4,…
$ WS              <dbl> 2.936, 2.826, 2.307, 1.012, 2.511, 2.764, 2.792, 1.562…
$ SAs             <dbl> 29.10, 45.10, 23.20, 23.80, 26.00, 25.60, 26.30, 31.60…
$ SPAs            <dbl> 9.890, 10.700, 5.869, 6.031, 6.627, 9.530, 6.708, 8.63…
$ SAoAs           <dbl> 13.280, 21.900, 12.266, 12.596, 13.805, 13.585, 13.969…
$ SAoFe           <dbl> 2500, 2670, 2160, 2500, 2060, 2500, 2520, 2140, 2150, …
$ SpH             <dbl> 7.74, 7.87, 8.03, 8.07, 7.81, 7.77, 7.66, 7.89, 8.00, …
$ SEc             <dbl> 1.128, 1.021, 1.257, 1.067, 1.354, 1.442, 1.428, 1.385…
$ SOC             <dbl> 1.66, 1.26, 1.36, 1.61, 1.26, 1.74, 1.71, 1.69, 1.41, …
$ SP              <dbl> 13.79, 15.31, 15.54, 16.28, 14.20, 13.41, 13.26, 14.84…
$ Sand            <dbl> 16.3, 11.1, 12.3, 12.7, 12.1, 16.7, 16.8, 13.7, 12.6, …
$ Silt            <dbl> 44.8, 48.7, 46.4, 43.6, 50.9, 43.6, 43.4, 40.8, 44.9, …
$ Clay            <dbl> 38.9, 40.2, 41.3, 43.7, 37.1, 39.8, 39.8, 45.5, 42.4, …
$ Elevation       <dbl> 3, 5, 4, 3, 5, 2, 2, 3, 3, 3, 2, 5, 6, 6, 5, 5, 4, 6, …
$ Year_Irrigation <dbl> 14, 20, 10, 8, 10, 9, 8, 10, 8, 2, 20, 4, 15, 10, 5, 4…
$ Distance_STW    <dbl> 5, 6, 5, 8, 5, 5, 10, 8, 10, 8, 5, 5, 9, 5, 10, 10, 12…
$ Land_type       <chr> "MHL", "MHL", "MHL", "MHL", "MHL", "MHL", "MHL", "MHL"…
$ Land_type_ID    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, …

Create a binary response variable

The binary class typically refers to a classification problem where there are only two possible classes or outcomes. As we see SAs is a continuous variable, but we need to convert it into binary class (contaminated and non-contaminated) based on our published paper. The paper suggests that we use the probability of exceeding the upper baseline soil arsenic concentration (14.8 mg/kg) to define soil samples as either As contaminated (Yes) or non-contaminated (No).

Here we will convert Soil As (SAs) into two classes- Yes = Contaminatedand No = noncontaminated and create a new binary response variables (Class_As) with two levels:

Yes = SAs > 14.8 mg As/kg

No = SAs < 14.8 mg As/kg

Code
mf$Class_As<- cut(mf$SAs, 
                   breaks=c(-Inf, 14.8, Inf), 
                   labels=c("0", "1"))
mf$Class_As<-as.factor(mf$Class_As)

Data Preparation

Data preparation is an essential step in the data analysis process. It involves cleaning, transforming, and organizing data to make it suitable for analysis. In this step, we will perform the following tasks:

Code
df <- mf |> 
  # select variables
  dplyr::select (WAs,  WFe,  
                SAoFe, SOC, 
                Year_Irrigation, Distance_STW,
                Land_type, Class_As) |> 
   # convert to factor
   dplyr::mutate_at(vars(Land_type), funs(factor))  |> 
   dplyr::mutate_at(vars(Class_As), funs(factor))  |> 
   # normalize the all numerical features 
   #dplyr::mutate_at(1:6,  funs((.-min(.))/max(.-min(.)))) |> 
   glimpse()
Rows: 263
Columns: 8
$ WAs             <dbl> 0.059, 0.059, 0.079, 0.122, 0.072, 0.042, 0.075, 0.064…
$ WFe             <dbl> 3.44, 4.93, 9.70, 8.58, 4.78, 6.95, 7.81, 8.14, 8.99, …
$ SAoFe           <dbl> 2500, 2670, 2160, 2500, 2060, 2500, 2520, 2140, 2150, …
$ SOC             <dbl> 1.66, 1.26, 1.36, 1.61, 1.26, 1.74, 1.71, 1.69, 1.41, …
$ Year_Irrigation <dbl> 14, 20, 10, 8, 10, 9, 8, 10, 8, 2, 20, 4, 15, 10, 5, 4…
$ Distance_STW    <dbl> 5, 6, 5, 8, 5, 5, 10, 8, 10, 8, 5, 5, 9, 5, 10, 10, 12…
$ Land_type       <fct> MHL, MHL, MHL, MHL, MHL, MHL, MHL, MHL, MHL, MHL, MHL,…
$ Class_As        <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, …

Split data into training and test

Split the data into training and test sets. We will use 70% of the data for training and 30% for testing. The ddply() function from the plyr package is used to split the data by the Land_type and Class_As variables.

Code
seeds = 11076
tr_prop = 0.70
# training data (70% data)
train= plyr::ddply(df,.(Land_type, Class_As),
                 function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test = plyr::ddply(df, .(Land_type, Class_As),
            function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)

Create x and y

You need to create two objects:

  • y for storing the outcome variable
  • x for holding the predictor variables. This should be created using the function model.matrix() allowing to automatically transform any qualitative variables (if any) into dummy variables, which is important because glmnet() can only take numerical, quantitative inputs. After creating the model matrix, we remove the intercept component at index = 1.
Code
# Predictor variables
x.train <- model.matrix(Class_As~., train)[,-1]
# Outcome variable
y.train <-train$Class_As

::: callout-note use model.matrix() when you get “error in evaluating the argument ‘x’ in selecting a method for function ‘as.matrix’”. For some reason glmnet prefers data.matrix() to as.matrix() :::

Ridge Regression

Ridge regression is a type of linear regression that uses L2 regularization to shrink the regression coefficients towards zero. This helps to prevent overfitting and improve the model’s generalization performance. The glmnet package in R provides efficient functions for fitting ridge regression models.

Cross-validation for the best lambda

Now we can apply cv.glmnet() function for cross-validation to choose the best lambda (regularization parameter). For example, suppose we designate \(α\)=0 for ridge regression and specify nlambda as 200. This implies that the model fit will be calculated solely for 200 \(λ\) values.

Code
# cross validation
ridge.cv <- cv.glmnet(x= x.train,
                       y= y.train, 
                       type.measure="deviance", 
                       nfold = 5,
                       alpha=0, 
                       family="binomial", 
                       nlambda=200, 
                       standardize = TRUE)

Printing the resulting object gives some basic information on the cross-validation performed:

Code
print(ridge.cv)

Call:  cv.glmnet(x = x.train, y = y.train, type.measure = "deviance",      nfolds = 5, alpha = 0, family = "binomial", nlambda = 200,      standardize = TRUE) 

Measure: Binomial Deviance 

     Lambda Index Measure      SE Nonzero
min 0.02984   192  0.8800 0.04925       7
1se 0.15079   157  0.9266 0.03374       7

We can plot ridge.cv object to see how each tested lambda value performed:

Code
plot(ridge.cv)

The plot shows the cross-validation error based on the logarithm of lambda. The dashed vertical line on the left indicates that the optimal logarithm of lambda is around -2, which minimizes the prediction error. This lambda value will provide the most accurate model. The exact value of lambda can be viewed as follow:

Code
ridge.cv$lambda.min
[1] 0.02984292

Generally, the purpose of regularization is to balance accuracy and simplicity. This means, a model with the smallest number of predictors that also gives a good accuracy. To this end, the function cv.glmnet() finds also the value of lambda that gives the simplest model but also lies within one standard error of the optimal value of lambda. This value is called lambda.1se.

Fit ridge regression

Now Fit the final model with the best “lambda”:

Code
# fit ridge regression 
ridge.fit <-  glmnet(x= x.train,
                       y= y.train, 
                       alpha=0, 
                       lambda = ridge.cv$lambda.1se,
                       family = "binomial",
                       nlambda=200,
                       standardize = TRUE)
Code
# Display regression coefficients
coef(ridge.fit )
8 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)     -2.507157e+00
WAs              4.392387e+00
WFe              1.528503e-01
SAoFe            4.593939e-06
SOC              6.601765e-01
Year_Irrigation  1.135548e-01
Distance_STW    -2.826980e-02
Land_typeMHL     7.419644e-01

Prediction on test data

Prediction at this stage is done on the test data. We will use the predict() function to make predictions on the test data. The type="response" argument is used to get the predicted probabilities. We will also create a new column Class_Prob_50 to store the predicted class based on a threshold of 0.5.

Code
# Make predictions on the test data
x.test <- model.matrix(Class_As~., test)[,-1]
# Outcome variable
y.test <-train$Class_As

# Prediction
ridge.pred<-as.data.frame(test$Class_As)
ridge.pred$Class_Prob<-ridge.fit  |> 
      predict(newx = x.test, type="response")
ridge.pred$Class_Prob_50<-ifelse(ridge.pred$Class_Prob > 0.5,"1","0") 
ridge.pred <- ridge.pred |> 
              dplyr::select("test$Class_As", "Class_Prob_50") |> 
              dplyr::rename(Obs_Class = "test$Class_As") |> 
              dplyr::rename(Pred_Class = "Class_Prob_50") 
glimpse(ridge.pred)
Rows: 80
Columns: 2
$ Obs_Class  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Pred_Class <chr[,1]> "1", "0", "0", "0", "0", "0", "1", "1", "0", "1", "0", …
Code
# ordering the levels
ridge.pred$Obs_Class  <- ordered(ridge.pred$Obs_Class, levels = c("1", "0"))
ridge.pred $Pred_Class <- ordered(ridge.pred$Pred_Class  , levels = c("1", "0"))

Model performance

Following function can be used to evaluate the model performance using accuracy, error rate, and confusion matrix:

Code
evaluate_classification <- function(observed, predicted) {
  # Convert inputs to factors if not already
  observed <- as.factor(observed)
  predicted <- as.factor(predicted)
  
  # Create a confusion matrix
  confusion_matrix <- table(Predicted = predicted, Actual = observed)
  
  # Extract values from confusion matrix
  TP <- confusion_matrix["1", "1"] # True Positives
  TN <- confusion_matrix["0", "0"] # True Negatives
  FP <- confusion_matrix["1", "0"] # False Positives
  FN <- confusion_matrix["0", "1"] # False Negatives
  
  # Handle cases where confusion matrix values may not exist
  TP <- ifelse(is.na(TP), 0, TP)
  TN <- ifelse(is.na(TN), 0, TN)
  FP <- ifelse(is.na(FP), 0, FP)
  FN <- ifelse(is.na(FN), 0, FN)
  
  # Compute metrics
  accuracy <- (TP + TN) / (TP + TN + FP + FN)
  precision <- ifelse((TP + FP) > 0, TP / (TP + FP), NA)
  recall <- ifelse((TP + FN) > 0, TP / (TP + FN), NA)  # Also called sensitivity
  specificity <- ifelse((TN + FP) > 0, TN / (TN + FP), NA)
  f1_score <- ifelse(!is.na(precision) & !is.na(recall) & (precision + recall) > 0,
                     2 * (precision * recall) / (precision + recall), NA)
  
  # Return results as a list
  metrics <- list(
    ConfusionMatrix = confusion_matrix,
    Accuracy = accuracy,
    Precision = precision,
    Recall = recall,
    Specificity = specificity,
    F1_Score = f1_score
  )
  
  return(metrics)
}
Code
evaluate_classification(ridge.pred$Obs_Class, ridge.pred$Pred_Class)
$ConfusionMatrix
         Actual
Predicted  1  0
        1 45 13
        0  7 15

$Accuracy
[1] 0.75

$Precision
[1] 0.7758621

$Recall
[1] 0.8653846

$Specificity
[1] 0.5357143

$F1_Score
[1] 0.8181818

ROC Curve

ROC curve is a graphical representation of the true positive rate (sensitivity) against the false positive rate (1-specificity). It shows the trade-off between sensitivity and specificity for different threshold values. The area under the ROC curve (AUC) is a measure of the model’s performance. A model with an AUC of 1 is perfect, while a model with an AUC of 0.5 is no better than random guessing.

Code
# Compute roc
library(pROC)
res.roc <- pROC::roc(ridge.pred$Obs_Class, ridge.pred$Pred_Class)
plot.roc(res.roc,
         print.auc = TRUE, 
         main= "Ridge Logistic Model")

LASSO regression

LASSO regression is a type of linear regression that uses L1 regularization to shrink the regression coefficients towards zero. This helps to prevent overfitting and improve the model’s generalization performance. The glmnet package in R provides efficient functions for fitting LASSO regression models.

Cross Validation of the best Lasso regression

We can set up our model very similarly to our ridge function. We simply need to change the alphaargument to 1 and do K-fold cross-validation:

Code
# cross validation
lasso.cv <- cv.glmnet(x= x.train,
                       y= y.train, 
                       type.measure="mse", 
                       nfold = 5,
                       alpha=1, 
                       family="binomial", 
                       nlambda=200,
                       standardize = TRUE)
Code
print(lasso.cv)

Call:  cv.glmnet(x = x.train, y = y.train, type.measure = "mse", nfolds = 5,      alpha = 1, family = "binomial", nlambda = 200, standardize = TRUE) 

Measure: Mean-Squared Error 

     Lambda Index Measure      SE Nonzero
min 0.00508    81  0.2783 0.03473       7
1se 0.05906    28  0.3117 0.01924       6
Code
plot(lasso.cv)

Code
# Choose lambda with standard error of the minimum
lasso.cv$lambda.1se
[1] 0.05906412

Fit a lasso regression

Now Fit the final model with the best “lambda”:

Code
# fit lasso regression 
lasso.fit <-  glmnet(x= x.train,
                       y= y.train, 
                       alpha=1, 
                       lambda = lasso.cv$lambda.1se,
                       family="binomial", 
                       nlambda=200,
                       standardize = TRUE)
Code
print(lasso.fit)

Call:  glmnet(x = x.train, y = y.train, family = "binomial", alpha = 1,      nlambda = 200, lambda = lasso.cv$lambda.1se, standardize = TRUE) 

  Df  %Dev  Lambda
1  6 30.83 0.05906

Finally we will extract the coefficients for the selected \(λ\) using coef() function:

Code
coef(lasso.fit)
8 x 1 sparse Matrix of class "dgCMatrix"
                          s0
(Intercept)     -2.115990618
WAs              4.066476107
WFe              0.125839634
SAoFe            .          
SOC              0.111318058
Year_Irrigation  0.146725917
Distance_STW    -0.002809843
Land_typeMHL     0.870813521

Prediction test data

Code
# Make predictions on the test data
x.test <- model.matrix(Class_As~., test)[,-1]
# Outcome variable
y.test <-train$Class_As

# Prediction
lasso.pred<-as.data.frame(test$Class_As)
lasso.pred$Class_Prob<-lasso.fit  |> 
      predict(newx = x.test, type="response")
lasso.pred$Class_Prob_50<-ifelse(lasso.pred$Class_Prob > 0.5,"1","0") 
lasso.pred <- lasso.pred |> 
              dplyr::select("test$Class_As", "Class_Prob_50") |> 
              dplyr::rename(Obs_Class = "test$Class_As") |> 
              dplyr::rename(Pred_Class = "Class_Prob_50") 
glimpse(lasso.pred)
Rows: 80
Columns: 2
$ Obs_Class  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Pred_Class <chr[,1]> "1", "0", "0", "0", "0", "0", "1", "0", "0", "1", "0", …
Code
# ordering the levels
lasso.pred$Obs_Class  <- ordered(lasso.pred$Obs_Class, levels = c("1", "0"))
lasso.pred $Pred_Class <- ordered(lasso.pred$Pred_Class  , levels = c("1", "0"))

Model performance

Code
evaluate_classification(lasso.pred$Obs_Class, lasso.pred$Pred_Class)
$ConfusionMatrix
         Actual
Predicted  1  0
        1 46 12
        0  6 16

$Accuracy
[1] 0.775

$Precision
[1] 0.7931034

$Recall
[1] 0.8846154

$Specificity
[1] 0.5714286

$F1_Score
[1] 0.8363636

ROC Curve

Code
# Compute roc
library(pROC)
res.roc <- pROC::roc(lasso.pred$Obs_Class, lasso.pred$Pred_Class)
plot.roc(res.roc,
         print.auc = TRUE, 
         main= "Lasso Logistic Model")

Elastic Net Regression

Elastic Net regression is a combination of L1 (LASSO) and L2 (Ridge) regularization. It provides a balance between feature selection and coefficient shrinkage. The glmnet package in R provides efficient functions for fitting Elastic Net regression models.

Cross Validation of the best Elastic Net regression

We can set up our model very similarly to our ridge function. We simply need to change the alphaargument to a value between 0 and 1 and do K-fold cross-validation:

Code
#|message: false
# Define hyperparameter grid
alphas <- seq(0, 1, by = 0.1)                   # Grid for alpha
lambda_seq <- 10^seq(3, -3, length = 100)       # Grid for lambda

# To store results
results <- data.frame(alpha = numeric(), lambda = numeric(), error = numeric())

# Perform grid search
for (a in alphas) {
  # Fit glmnet model for each alpha
  cv_fit <- cv.glmnet(x.train, y.train, family = "binomial", alpha = a, lambda = lambda_seq, type.measure = "class")
  
  # Extract best lambda and error for the current alpha
  best_lambda <- cv_fit$lambda.min
  best_error <- min(cv_fit$cvm)  # Minimum cross-validation error
  
  # Store results
  results <- rbind(results, data.frame(alpha = a, lambda = best_lambda, error = best_error))
}

# Print results
print(results)
   alpha     lambda     error
1    0.0 0.07564633 0.2021858
2    0.1 0.06579332 0.1803279
3    0.2 0.03274549 0.1857923
4    0.3 0.02848036 0.1803279
5    0.4 0.04328761 0.1803279
6    0.5 0.05722368 0.1912568
7    0.6 0.02848036 0.1857923
8    0.7 0.02477076 0.1803279
9    0.8 0.02848036 0.1803279
10   0.9 0.04328761 0.1803279
11   1.0 0.02848036 0.1857923
Code
best_params <- results[which.min(results$error), ]
print(best_params)
  alpha     lambda     error
2   0.1 0.06579332 0.1803279
Code
# Refit the final model with optimal parameters
enet.fit <- glmnet(x.train, y.train, family = "binomial", alpha = best_params$alpha, lambda = best_params$lambda)

# View coefficients
print(coef(enet.fit))
8 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)     -3.255323e+00
WAs              6.046513e+00
WFe              1.964624e-01
SAoFe            5.624383e-06
SOC              6.730575e-01
Year_Irrigation  1.585023e-01
Distance_STW    -3.348883e-02
Land_typeMHL     8.921267e-01

Prediction test data

Prediction at this stage is done on the test data. We will use the predict() function to make predictions on the test data. The type="response" argument is used to get the predicted probabilities. We will also create a new column Class_Prob_50 to store the predicted class based on a threshold of 0.5.

Code
# Make predictions on the test data
x.test <- model.matrix(Class_As~., test)[,-1]
# Outcome variable
y.test <-train$Class_As

# Prediction
enet.pred<-as.data.frame(test$Class_As)
enet.pred$Class_Prob<-enet.fit  |> 
      predict(newx = x.test, type="response")
enet.pred$Class_Prob_50<-ifelse(enet.pred$Class_Prob > 0.5,"1","0") 
enet.pred <- enet.pred |> 
              dplyr::select("test$Class_As", "Class_Prob_50") |> 
              dplyr::rename(Obs_Class = "test$Class_As") |> 
              dplyr::rename(Pred_Class = "Class_Prob_50") 
glimpse(enet.pred)
Rows: 80
Columns: 2
$ Obs_Class  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Pred_Class <chr[,1]> "1", "0", "0", "0", "0", "0", "1", "1", "0", "1", "0", …
Code
# ordering the levels
enet.pred$Obs_Class  <- ordered(enet.pred$Obs_Class, levels = c("1", "0"))
enet.pred $Pred_Class <- ordered(enet.pred$Pred_Class  , levels = c("1", "0"))

Model performance

Code
evaluate_classification (enet.pred$Obs_Class , enet.pred$Pred_Class)
$ConfusionMatrix
         Actual
Predicted  1  0
        1 44 12
        0  8 16

$Accuracy
[1] 0.75

$Precision
[1] 0.7857143

$Recall
[1] 0.8461538

$Specificity
[1] 0.5714286

$F1_Score
[1] 0.8148148

ROC Curve

ROC curve is a graphical representation of the true positive rate (sensitivity) against the false positive rate (1-specificity). It shows the trade-off between sensitivity and specificity for different threshold values. The area under the ROC curve (AUC) is a measure of the model’s performance. A model with an AUC of 1 is perfect, while a model with an AUC of 0.5 is no better than random guessing.

Code
# Compute roc
library(pROC)
enet.roc <- pROC::roc(enet.pred$Obs_Class, enet.pred$Pred_Class)
plot.roc(enet.roc,
         print.auc = TRUE, 
         main= "Elastic Net Logistic Model")

Summary and Conclusion

In this tutorial, we explored regularized logistic regression, using the glmnet packags in R. This package ise useful for building robust classifiers for binary classification tasks with regularization. Here, we focused on lasso and ridge models, using alpha = 1 and alpha = 0, respectively. For elastic net regression, you need to choose an alpha value somewhere between 0 and 1. This can be parameter tuning using grid search with the caret or h20 packages (please see Machine Learning chapter).

References

  1. Ridge and Lasso in R

  2. Penalized Logistic Regression Essentials in R: Ridge, Lasso and Elastic Net