3. Probit Regression

Probit regression model is a statistical technique used to model binary response variables. It estimates the probability of an event occurring based on various predictor variables. This probability is represented using the cumulative distribution function (CDF) of a normal distribution, making probit regression a solid alternative to logistic regression when the normality assumptions are met. In this tutorial, we will provide an overview of the Probit Model. We will walk through a manual fitting process using maximum likelihood estimation (MLE) to understand the mechanics behind the model. Then, we will demonstrate how to fit a probit model in R using the glm() function. We’ll also cover how to interpret the model outputs, evaluate model performance, and visualize the results using different R packages. By the end of this tutorial, you will have a solid understanding of the probit model and how to apply it in R for your research or projects. Let’s get started!

Overview

Probit regression is a statistical technique that is used for analyzing binary or dichotomous outcomes, where the dependent variable can take on only two possible values, usually coded as 0 and 1. This technique is similar to logistic regression in that it models the probability of the dependent variable being one of the two possible outcomes.

In probit regression, the relationship between predictor variables and the probability of the outcome is modeled using the cumulative distribution function (CDF) of the standard normal distribution, which is known as the probit function. The probit function maps any real-valued input to a probability between 0 and 1, and it is used to model the probability that a binary outcome will be positive (or negative) given the values of the predictor variables.

When using probit regression, the coefficients of the model represent the change in the probability of the outcome associated with a unit change in the predictor variable. These coefficients are estimated using maximum likelihood estimation, which is a statistical technique used for estimating the parameters of a probability distribution by maximizing a likelihood function.

Model Specification:
- Assume \(Y \sim \text{Bernoulli}(p)\), where \(p\) is the probability that \(Y = 1\).
- The probit model uses the standard normal cumulative distribution function (CDF)\(\Phi\) to model \(p\) as:
  
  \[ p = \Phi(X\beta)\]
- This means: \[ P(Y = 1 | X) = \int_{-\infty}^{X\beta} \frac{1}{\sqrt{2 \pi}} e^{-\frac{t^2}{2}} \, dt \]
Link Function:
- The probit link function is \(g(p) = \Phi^{-1}(p)\), where \(\Phi\) is the CDF of the standard normal distribution.
Parameter Estimation:
- Maximum likelihood estimation (MLE) is also used here.
- The log-likelihood function for the probit model is:
  
  \[ \ln \mathcal{L}(\beta) = \sum_{i=1}^{n} \left( Y_i \ln(\Phi(X_i\beta)) + (1 - Y_i) \ln(1 - \Phi(X_i\beta)) \right) \]
- Numerical optimization methods are required to maximize the log-likelihood and estimate \(\beta\).
Prediction:
- For a new observation \(X\), predict the probability of \(Y = 1\)) as:
  
  \[ p = \Phi(X\beta) \]
Interpretation:
- Coefficients \(\beta_j\) represent the change in the z-score (from the standard normal distribution) associated with a one-unit increase in \(X_j\). Risk ratios are exponential of Coefficients \(\beta_j\).

Difference between Logistic and Probit Models

Probit regression and logistic regression are both used for modeling binary outcomes, but they differ in the way they model the relationship between the predictors and the probability of the outcome. Here are the key differences between probit and logistic regression:

Link function:

In probit regression, the link function is the cumulative distribution function (CDF) of the standard normal distribution, also known as the probit function. This method models the linear combination of predictors as the inverse of the CDF of a standard normal distribution. On the other hand, logistic regression uses the logistic function, also known as the sigmoid function (S-shaped), as the link function. This method models the log-odds of the outcome being in one of the categories as a linear combination of predictors.

Interpretation of coefficients:

In logistic regression, the coefficients represent the change in the log-odds of the outcome for a one-unit change in the predictor, assuming other predictors remain constant. This makes it straightforward to interpret the impact of each predictor variable on the outcome in terms of odds ratios or probabilities. For example, if the coefficient for a predictor is 0.5, it means that a one-unit increase in that predictor is associated with a 50% increase in the odds of the outcome.

On the other hand, in probit regression, coefficients are interpreted in terms of standard deviations of the latent variable, which is the unobservable variable underlying the binary outcome. While this interpretation might be less intuitive for some users, it still provides valuable information about the relationship between predictors and the probability of the outcome. However, converting these coefficients into easily interpretable odds ratios or probabilities is not as straightforward as in logistic regression.

Overall, while logistic regression coefficients offer more direct interpretability in terms of odds ratios or probabilities, probit regression coefficients still provide valuable insights into the relationship between predictors and the probability of the outcome, albeit with a slightly different interpretation approach. The choice between logistic and probit regression often depends on the specific needs of the analysis and the preferences of the researcher.

interpretation:

Logistic regression coefficients are more straightforward to interpret as they indicate changes in log-odds, which can be easily converted into odds ratios or probabilities. On the other hand, probit regression coefficients are interpreted in terms of standard deviations of the latent variable, which may be less intuitive for some users.

Assumptions:

Probit regression assumes that the errors follow a normal distribution, while logistic regression assumes that the errors follow a logistic distribution. Both models require that the relationship between the predictors and the outcome is linear on the logit or probit scale. This means that the effect of a predictor on the outcome is proportional to the logarithm of the odds ratio or the cumulative distribution function of the standard normal distribution.

Applications:

Probit regression is commonly used in economics and social sciences, while logistic regression is widely used in various fields, including medicine, biology, social sciences, and machine learning, due to its simplicity and ease of interpretation.

In practice, both probit and logistic regression can be suitable for modeling binary outcomes, and the choice between them often depends on the specific requirements of the analysis and the preferences of the researcher.

Probit Model from Scratch

Building a Probit model from scratch in R involves generating synthetic data, implementing the model fitting process, creating a summary table for model parameters, performing cross-validation, and evaluating model performance. Here’s a step-by-step guide on how to accomplish this in R without using any external packages.

Data Generation

Let’s start by generating synthetic data with four covariates and a binary outcome.

Code

# Set random seed for reproducibility
set.seed(123)

# Generate data with 1000 observations
n <- 1000

# Generate four covariates
X1 <- rnorm(n)
X2 <- rnorm(n)
X3 <- rnorm(n)
X4 <- rnorm(n)

# True coefficients for the model
beta_true <- c(0.5, -1.0, 0.8, -0.6, 1.2)  # Intercept and 4 coefficients

# Calculate the latent variable (linear combination of predictors and coefficients)
latent_variable <- beta_true[1] + beta_true[2]*X1 + beta_true[3]*X2 + beta_true[4]*X3 + beta_true[5]*X4

# Generate binary outcome Y based on the latent variable, with some noise
Y <- ifelse(latent_variable + rnorm(n) > 0, 1, 0)

# Combine data into a data frame
data <- data.frame(Y, X1, X2, X3, X4)

Model Fitting

Now, we define a function to fit the Probit model using maximum likelihood estimation. We use the cumulative distribution function (CDF) of the normal distribution to transform the linear predictor into probabilities, then maximize the log-likelihood function.

Code

# Define the log-likelihood function for the Probit model
log_likelihood <- function(beta, Y, X) {
  # Calculate linear predictor
  linear_pred <- X %*% beta
  # Get the predicted probabilities using the normal CDF
  p <- pnorm(linear_pred)
  # Calculate the log-likelihood
  ll <- sum(Y * log(p) + (1 - Y) * log(1 - p))
  return(-ll)  # Return the negative log-likelihood for minimization
}

# Function to fit the Probit model
fit_probit <- function(Y, X) {
  # Add an intercept column to the predictors matrix
  X <- as.matrix(cbind(1, X))
  # Initial parameter values for optimization
  initial_beta <- rep(0, ncol(X))
  # Perform optimization to maximize the log-likelihood
  fit <- optim(initial_beta, log_likelihood, Y = Y, X = X, method = "BFGS", hessian = TRUE)
  return(list(coefficients = fit$par, hessian = fit$hessian, convergence = fit$convergence))
}

# Fit the Probit model to the generated data
fit <- fit_probit(data$Y, data[, -1])  # Exclude Y column for predictors

Summary Statistic Table

We use the fitted model to calculate standard errors, z-values, and p-values for each coefficient.

Code

# Function to create a summary table for the model
create_summary_table <- function(fit) {
  # Extract estimated coefficients
  beta_hat <- fit$coefficients
  
  # Calculate covariance matrix (inverse of negative Hessian)
  hessian <- fit$hessian
  cov_matrix <- solve(-hessian)
  
  # Calculate standard errors from the covariance matrix
  standard_errors <- sqrt(diag(cov_matrix))
  
  # Calculate z-values and p-values
  z_values <- beta_hat / standard_errors
  p_values <- 2 * (1 - pnorm(abs(z_values)))
  
  # Create the summary table
  summary_table <- data.frame(
    Coefficient = beta_hat,
    Std.Error = standard_errors,
    z.value = z_values,
    p.value = p_values
  )
  
  # Label the rows
  rownames(summary_table) <- c("Intercept", "X1", "X2", "X3", "X4")
  
  return(summary_table)
}

# Generate summary table
summary_table <- create_summary_table(fit)

Warning in sqrt(diag(cov_matrix)): NaNs produced

Code

print(summary_table)

          Coefficient Std.Error z.value p.value
Intercept   0.4950479       NaN     NaN     NaN
X1         -1.0312880       NaN     NaN     NaN
X2          1.0020371       NaN     NaN     NaN
X3         -0.6179513       NaN     NaN     NaN
X4          1.2137153       NaN     NaN     NaN

Risk Ratios

In a probit model, the risk ratio isn’t calculated directly in the same way as it would be for a linear model or a log-binomial model. This is because the probit model uses the cumulative distribution function of a standard normal distribution (Φ) to estimate the probability of an outcome, based on the relationship:

\[ P(Y=1∣X)=Φ(Xβ)\]

where \(X\) is your predictor, then calculate:

\[ P(Y=1∣X=1)=Φ(β0+β1)\]

and

\[ P(Y=1∣X=0)=Φ(β0) \]
Approximate the Risk Ratio: Compute the ratio of these two predicted probabilities:

\[ text{Approximate Risk Ratio} = \frac{P(Y = 1 | X = 1)}{P(Y = 1 | X = 0)} \]

where \(\Phi\) is the cumulative distribution function of the standard normal distribution, \(X\) represents the predictors, and \(\beta\) represents the coefficients

Code

# Extract coefficients
coeffecients<- summary_table$Coefficient   
beta_0<-coeffecients[1] # Intercept
beta_1<-coeffecients[2]  # Coefficient for X1

# Calculate predicted probabilities at X = 1 and X = 0
p_X1 <- pnorm(beta_0 + beta_1)  # Probability when X = 1
p_X0 <- pnorm(beta_0)           # Probability when X = 0

# Calculate approximate risk ratio
approx_risk_ratio <- p_X1 / p_X0
cat("Approximate Risk Ratio:", approx_risk_ratio, "\n")

Approximate Risk Ratio: 0.4290113

Cross-Validation

We implement k-fold cross-validation (using k=5 as an example) to evaluate the model’s performance.

Code

# Define a function for k-fold cross-validation
cross_validation_probit <- function(data, k = 5) {
  n <- nrow(data)
  # Randomly shuffle and split data into k folds
  fold_indices <- sample(rep(1:k, length.out = n))
  accuracies <- numeric(k)  # Store accuracy for each fold
  
  for (i in 1:k) {
    # Split data into training and testing sets
    train_data <- data[fold_indices != i, ]
    test_data <- data[fold_indices == i, ]
    
    # Fit the model on training data
    fit <- fit_probit(train_data$Y, train_data[, -1])
    beta_hat <- fit$coefficients  # Access estimated coefficients
    
    # Make predictions on the test set
    X_test <- as.matrix(cbind(1, test_data[, -1]))  # Add intercept
    linear_pred <- X_test %*% beta_hat
    probabilities <- pnorm(linear_pred)  # Predicted probabilities
    predictions <- ifelse(probabilities > 0.5, 1, 0)  # Binary predictions
    
    # Calculate accuracy
    accuracies[i] <- mean(predictions == test_data$Y)
  }
  
  # Return the average accuracy across all folds
  return(mean(accuracies))
}

# Perform 5-fold cross-validation
set.seed(456)  # For reproducibility
cv_accuracy <- cross_validation_probit(data, k = 5)
cat("Average Accuracy from 5-Fold Cross-Validation:", cv_accuracy, "\n")

Average Accuracy from 5-Fold Cross-Validation: 0.851

Model Evaluation

Finally, we evaluate the fitted model on the entire dataset, calculating the accuracy, sensitivity, and specificity.

Code

# Model evaluation on the entire dataset
evaluate_probit_model <- function(Y, X, beta_hat) {
  # Make predictions on the dataset
  X <- as.matrix(cbind(1, X))  # Add intercept
  linear_pred <- X %*% beta_hat
  probabilities <- pnorm(linear_pred)  # Get predicted probabilities
  predictions <- ifelse(probabilities > 0.5, 1, 0)  # Convert to binary outcome
  
  # Calculate accuracy
  accuracy <- mean(predictions == Y)
  
  # Calculate sensitivity and specificity
  true_positives <- sum(predictions == 1 & Y == 1)
  true_negatives <- sum(predictions == 0 & Y == 0)
  false_positives <- sum(predictions == 1 & Y == 0)
  false_negatives <- sum(predictions == 0 & Y == 1)
  
  sensitivity <- true_positives / (true_positives + false_negatives)
  specificity <- true_negatives / (true_negatives + false_positives)
  
  # Create an evaluation table
  evaluation <- data.frame(
    Metric = c("Accuracy", "Sensitivity", "Specificity"),
    Value = c(accuracy, sensitivity, specificity)
  )
  
  return(evaluation)
}

# Evaluate the model on the entire dataset
evaluation_results <- evaluate_probit_model(data$Y, data[, -1], fit$coefficients)
print(evaluation_results)

       Metric     Value
1    Accuracy 0.8510000
2 Sensitivity 0.8870432
3 Specificity 0.7964824

Probit Model in R

If you want to run a probit regression in R, it’s just like running a logistic regression. But instead of log-odds, your outcomes will be in z-score units. However, z-score units are not easy to understand and may not be useful for the readers. In probit regression, you cannot use an odds ratio because we are not working with odds or log-odds outcomes anymore. Instead, we rely on probability to interpret the results. If you want to understand how R calculates probability from z-score units, you need to know your z-score tables and normal distribution functions. Calculating probability from a normal distribution is complicated, so we use approximation, which is what z-tables are for. In practice, we use the ggpredict() and margins() functions to predict probabilities and marginal effects, respectively, from our coefficients.

Install Required R Packages

Before running the code, make sure you have the following R packages installed: tidyverse, plyr, rstaix, DataExplorer, patchwork, RColorBrewer, corrplot, ggstatsplot, sjPlot, gtsummary, jtools, report, performance, margins, ggeffects and marginaleffects. You can install these packages using the following code:

Code

# Packages to install
packages <- c(
  "tidyverse", 
  "plyr",
  "rstatix",
  "DataExplorer",
  "patchwork", 
  "RColorBrewer",
  "DataExplorer",
  "corrplot",
  "ggstatsplot",
  "sjPlot", 
  "gtsummary", 
  "jtools", 
  "report",
  "performance",
  "margins",
  "marginaleffects", 
  "ggeffects"
)

#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Load R Oackages

Now, let’s load the required R packages for the Probit model analysis.

Code

# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Code

# Check loaded packages
cat("Successfully loaded packages:\n")

Successfully loaded packages:

Code

print(search()[grepl("package:", search())])

 [1] "package:ggeffects"       "package:marginaleffects"
 [3] "package:margins"         "package:performance"    
 [5] "package:report"          "package:jtools"         
 [7] "package:gtsummary"       "package:sjPlot"         
 [9] "package:ggstatsplot"     "package:corrplot"       
[11] "package:RColorBrewer"    "package:patchwork"      
[13] "package:DataExplorer"    "package:rstatix"        
[15] "package:plyr"            "package:lubridate"      
[17] "package:forcats"         "package:stringr"        
[19] "package:dplyr"           "package:purrr"          
[21] "package:readr"           "package:tidyr"          
[23] "package:tibble"          "package:ggplot2"        
[25] "package:tidyverse"       "package:stats"          
[27] "package:graphics"        "package:grDevices"      
[29] "package:utils"           "package:datasets"       
[31] "package:methods"         "package:base"

Data

In this exercise we will use Pima Indians Diabetes Database. This dataset was obtained from the National Institute of Diabetes and Digestive and Kidney Diseases. The goal of this dataset is to predict whether or not a patient has diabetes, using specific diagnostic measurements included in the data. Some limitations were put in place while selecting instances from a larger database. Specifically, all the patients in this dataset are female, at least 21 years old, and of Pima Indian ancestry.

Code

mf<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/Regression_analysis/puma_diabetes.csv") |> 
  glimpse()

Rows: 768 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): Preg, Gluc, BP, Skin, Insulin, BMI, DPF, Age, Outcome

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Rows: 768
Columns: 9
$ Preg    <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1,…
$ Gluc    <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139, …
$ BP      <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0, …
$ Skin    <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0,…
$ Insulin <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230,…
$ BMI     <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37.…
$ DPF     <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158,…
$ Age     <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 32…
$ Outcome <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,…

Preg - Number of times pregnant Numeric [0, 17]
Gluc - Plasma glucose concentration at 2 Hours in an oral glucose tolerance test (GTIT) Numeric [0, 199]
BP - Diastolic Blood Pressure (mm Hg) Numeric [0, 122]
Skin - Triceps skin fold thickness (mm) Numeric [0, 99]
Insulin - 2-Hour Serum insulin (µh/ml) Numeric [0, 846]
BMI - Body mass index [weight in kg/(Height in m)] Numeric [0, 67.1]
DPF - Diabetes pedigree function Numeric [0.078, 2.42]
Age - Age (years) Numeric [21, 81].
Outcome - Binary value indicating non-diabetic /diabetic Factor [0,1]

Note

Source:Chang V, Bailey J, Xu QA, Sun Z. Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Comput Appl. 2022 Mar 24:1-17. doi: 10.1007/s00521-022-07049-z. Epub ahead of print. PMID: 35345556; PMCID: PMC8943493.

Plot basic data information

plot_intro() function of {DataExplorer} package plot basic information (from introduce) the data.

Code

mf  |>
   plot_intro()

Summary Statistics

Code

mf |> 
  # select variables
  dplyr::select (Preg, Gluc, BP,  
                Skin, Insulin, BMI, 
                BP, Age) |>  
  get_summary_stats (type = "common")

# A tibble: 7 × 10
  variable     n   min   max median   iqr   mean     sd    se    ci
  <fct>    <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>
1 Preg       768     0  17      3     5     3.84   3.37 0.122 0.239
2 Gluc       768     0 199    117    41.2 121.    32.0  1.15  2.26 
3 BP         768     0 122     72    18    69.1   19.4  0.698 1.37 
4 Skin       768     0  99     23    32    20.5   16.0  0.576 1.13 
5 Insulin    768     0 846     30.5 127.   79.8  115.   4.16  8.16 
6 BMI        768     0  67.1   32     9.3  32.0    7.88 0.284 0.558
7 Age        768    21  81     29    17    33.2   11.8  0.424 0.833

Correlation

Code

cor_data<- mf |> 
  # select variables
  dplyr::select (Preg, Gluc, BP,  
                Skin, Insulin, BMI, 
                BP, Age) |> 
         cor()
cor_data

               Preg       Gluc         BP        Skin     Insulin        BMI
Preg     1.00000000 0.12945867 0.14128198 -0.08167177 -0.07353461 0.01768309
Gluc     0.12945867 1.00000000 0.15258959  0.05732789  0.33135711 0.22107107
BP       0.14128198 0.15258959 1.00000000  0.20737054  0.08893338 0.28180529
Skin    -0.08167177 0.05732789 0.20737054  1.00000000  0.43678257 0.39257320
Insulin -0.07353461 0.33135711 0.08893338  0.43678257  1.00000000 0.19785906
BMI      0.01768309 0.22107107 0.28180529  0.39257320  0.19785906 1.00000000
Age      0.54434123 0.26351432 0.23952795 -0.11397026 -0.04216295 0.03624187
                Age
Preg     0.54434123
Gluc     0.26351432
BP       0.23952795
Skin    -0.11397026
Insulin -0.04216295
BMI      0.03624187
Age      1.00000000

Code

corrplot::corrplot(cor_data)

Box/Violine Plots

We can create a nice looking plots with results of ANOVA and post-hoc tests on the same plot (directly on the boxplots). We will use gbetweenstats() function of ggstatsplot package:

Code

p1<-ggstatsplot::ggbetweenstats(
  data = mf,
  x = Outcome,
  y = Preg,
  ylab = "Number of times pregnant",
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)+
# add plot title
ggtitle("Number of times pregnant") +
   theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    axis.line = element_line(colour = "gray"),
    # axis title font size
    axis.title.x = element_text(size = 14), 
    # X and  axis font size
    axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5),
    axis.text.x = element_text(size=12))

p2<-ggstatsplot::ggbetweenstats(
  data = mf,
  x = Outcome,
  y = Gluc,
  ylab = "Glucose concentration",
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)+
# add plot title
ggtitle("   Plasma glucose concentration") +
   theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    axis.line = element_line(colour = "gray"),
    # axis title font size
    axis.title.x = element_text(size = 14), 
    # X and  axis font size
    axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5),
    axis.text.x = element_text(size=12))

p3<-ggstatsplot::ggbetweenstats(
  data = mf,
  x = Outcome,
  y = BMI,
  ylab = "Body mass index",
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)+
# add plot title
ggtitle("Body mass index") +
   theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    axis.line = element_line(colour = "gray"),
    # axis title font size
    axis.title.x = element_text(size = 14), 
    # X and  axis font size
    axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5),
    axis.text.x = element_text(size=12))


p4<-ggstatsplot::ggbetweenstats(
  data = mf,
  x = Outcome,
  y = BP,
  ylab = "Blood Pressure (mm Hg)",
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)+
# add plot title
ggtitle("Diastolic Blood Pressure (mm Hg)") +
   theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    axis.line = element_line(colour = "gray"),
    # axis title font size
    axis.title.x = element_text(size = 14), 
    # X and  axis font size
    axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5),
    axis.text.x = element_text(size=12))

Code

(p1|p2)/(p3|p4)

Data Processing

Code

df  <- mf |>
  dplyr::mutate_at(vars(Outcome), funs(factor))  |> 
    #normalize the all numerical features 
   dplyr::mutate_at(1:8,  funs((.-min(.))/max(.-min(.)))) |> 
  glimpse()

Rows: 768
Columns: 9
$ Preg    <dbl> 0.35294118, 0.05882353, 0.47058824, 0.05882353, 0.00000000, 0.…
$ Gluc    <dbl> 0.7437186, 0.4271357, 0.9195980, 0.4472362, 0.6884422, 0.58291…
$ BP      <dbl> 0.5901639, 0.5409836, 0.5245902, 0.5409836, 0.3278689, 0.60655…
$ Skin    <dbl> 0.3535354, 0.2929293, 0.0000000, 0.2323232, 0.3535354, 0.00000…
$ Insulin <dbl> 0.00000000, 0.00000000, 0.00000000, 0.11111111, 0.19858156, 0.…
$ BMI     <dbl> 0.5007452, 0.3964232, 0.3472429, 0.4187779, 0.6423249, 0.38152…
$ DPF     <dbl> 0.23441503, 0.11656704, 0.25362938, 0.03800171, 0.94363792, 0.…
$ Age     <dbl> 0.48333333, 0.16666667, 0.18333333, 0.00000000, 0.20000000, 0.…
$ Outcome <fct> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,…

Split Data

Code

seeds = 11076
tr_prop = 0.70
# training data (70% data)
train= ddply(df,.(Outcome),
                 function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test = ddply(df, .(Outcome),
            function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)

Code

print(prop.table(table(train$Outecome)))

numeric(0)

Code

print(prop.table(table(test$Outcome)))


        0         1 
0.6493506 0.3506494

Fit a Probit Model

We will use Probit regression to predict customer feed back based on various attributes related to occupation, family Size, location etc. We will use glm() function with family = binomial(link = "probit")) for probit regression.

Code

fit.probit<-glm(Outcome~., data= train,
             family = binomial(link = "probit"))

Model Summary

Code

summary(fit.probit)


Call:
glm(formula = Outcome ~ ., family = binomial(link = "probit"), 
    data = train)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -4.6161     0.4543 -10.162  < 2e-16 ***
Preg          1.0582     0.3994   2.650  0.00805 ** 
Gluc          4.4327     0.5196   8.530  < 2e-16 ***
BP           -1.2915     0.4763  -2.712  0.00670 ** 
Skin         -0.3250     0.4827  -0.673  0.50075    
Insulin      -0.1591     0.5677  -0.280  0.77928    
BMI           3.4661     0.6786   5.108 3.25e-07 ***
DPF           1.2675     0.4942   2.565  0.01032 *  
Age           0.7690     0.4320   1.780  0.07507 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 694.17  on 536  degrees of freedom
Residual deviance: 491.29  on 528  degrees of freedom
AIC: 509.29

Number of Fisher Scoring iterations: 5

We can see that the Preg coefficient is 0.062. Here we can infer that for each unit increase in Preg, the predicted probability of having dibetes increases by a factor of exp(0.062) = 1.06, or about 6%. This effect is statistically significant with a p-value of 0.00805.

Code

1-4.9

[1] -3.9

summ function of {jtools} produce summary table of regression models:

Code

jtools::summ(fit.probit)

Observations	537
Dependent variable	Outcome
Type	Generalized linear model
Family	binomial
Link	probit

𝛘²(8)	202.89
p	0.00
Pseudo-R² (Cragg-Uhler)	0.43
Pseudo-R² (McFadden)	0.29
AIC	509.29
BIC	547.86

	Est.	S.E.	z val.	p
(Intercept)	-4.62	0.45	-10.16	0.00
Preg	1.06	0.40	2.65	0.01
Gluc	4.43	0.52	8.53	0.00
BP	-1.29	0.48	-2.71	0.01
Skin	-0.33	0.48	-0.67	0.50
Insulin	-0.16	0.57	-0.28	0.78
BMI	3.47	0.68	5.11	0.00
DPF	1.27	0.49	2.56	0.01
Age	0.77	0.43	1.78	0.08
Standard errors: MLE

Risk Ratio

The tbl_regression() function from the {gtsummary} package takes a regression model object as input and produces a formatted table with Odd-ratio and confidence interval.

Code

tbl_regression(fit.probit,  exp = TRUE)

Characteristic	exp(Beta)	95% CI	p-value
Preg	2.88	1.33, 6.28	0.008
Gluc	84.2	31.9, 229	<0.001
BP	0.27	0.11, 0.69	0.007
Skin	0.72	0.28, 1.86	0.5
Insulin	0.85	0.28, 2.57	0.8
BMI	32.0	9.02, 118	<0.001
DPF	3.55	1.36, 9.37	0.010
Age	2.16	0.93, 5.00	0.075
Abbreviation: CI = Confidence Interval

The tab_model() function of {sjPlot} package also creates HTML tables from regression models:

Code

tab_model(fit.probit)

	Outcome
Predictors	Risk Ratios	CI	p
(Intercept)	0.01	0.00 – 0.02	<0.001
Preg	2.88	1.33 – 6.28	0.008
Gluc	84.16	31.95 – 229.50	<0.001
BP	0.27	0.11 – 0.69	0.007
Skin	0.72	0.28 – 1.86	0.501
Insulin	0.85	0.28 – 2.57	0.779
BMI	32.01	9.02 – 118.44	<0.001
DPF	3.55	1.36 – 9.37	0.010
Age	2.16	0.93 – 5.00	0.075
Observations	537
R² Nagelkerke	0.434

plot_model() function of {sjPlot} package creates plots the estimates from logistic model:

Code

plot_model(fit.probit, vline.color = "red")

Interpretation of Probit Model

We can use reoprt() function of {report} package to further explain the fit.probit object.

Code

report::report(fit.probit)

We fitted a probit model (estimated using ML) to predict Outcome with Preg,
Gluc, BP, Skin, Insulin, BMI, DPF and Age (formula: Outcome ~ Preg + Gluc + BP
+ Skin + Insulin + BMI + DPF + Age). The model's explanatory power is
substantial (Nagelkerke's R2 = 0.43). The model's intercept, corresponding to
Preg = 0, Gluc = 0, BP = 0, Skin = 0, Insulin = 0, BMI = 0, DPF = 0 and Age =
0, is at -4.62 (95% CI [-5.52, -3.76], p < .001). Within this model:

  - The effect of Preg is statistically significant and positive (beta = 1.06,
95% CI [0.29, 1.84], p = 0.008; Std. beta = 0.21, 95% CI [0.06, 0.36])
  - The effect of Gluc is statistically significant and positive (beta = 4.43,
95% CI [3.46, 5.44], p < .001; Std. beta = 0.70, 95% CI [0.55, 0.86])
  - The effect of BP is statistically significant and negative (beta = -1.29, 95%
CI [-2.23, -0.37], p = 0.007; Std. beta = -0.20, 95% CI [-0.34, -0.06])
  - The effect of Skin is statistically non-significant and negative (beta =
-0.33, 95% CI [-1.27, 0.62], p = 0.501; Std. beta = -0.05, 95% CI [-0.20,
0.10])
  - The effect of Insulin is statistically non-significant and negative (beta =
-0.16, 95% CI [-1.26, 0.95], p = 0.779; Std. beta = -0.02, 95% CI [-0.16,
0.12])
  - The effect of BMI is statistically significant and positive (beta = 3.47, 95%
CI [2.20, 4.77], p < .001; Std. beta = 0.42, 95% CI [0.27, 0.58])
  - The effect of DPF is statistically significant and positive (beta = 1.27, 95%
CI [0.31, 2.24], p = 0.010; Std. beta = 0.17, 95% CI [0.04, 0.30])
  - The effect of Age is statistically non-significant and positive (beta = 0.77,
95% CI [-0.07, 1.61], p = 0.075; Std. beta = 0.14, 95% CI [-0.01, 0.30])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald z-distribution approximation.

Model Performance

Code

performance::performance(fit.probit)

# Indices of model performance

AIC     |    AICc |     BIC | Nagelkerke's R2 |  RMSE | Sigma | Log_loss
------------------------------------------------------------------------
509.288 | 509.629 | 547.862 |           0.434 | 0.384 | 1.000 |    0.457

AIC     | Score_log | Score_spherical |   PCP
---------------------------------------------
509.288 |      -Inf |           0.006 | 0.702

Visualization of model assumptions

The package {performance} provides many functions to check model assumptions, like check_collinearity(), check_normality() or check_heteroscedasticity(). To get a comprehensive check, use check_model()

Code

performance::check_model(fit.probit)

Marginal Effects and Adjusted Predictions

The presentation of regression models, typically in the form of tables, is widely accepted as a clear and accessible method for interpreting results. However, for more intricate models that incorporate interaction or transformed terms, such as quadratic or spline terms, the use of raw regression coefficients may prove less effective, resulting in challenges when interpreting outcomes. In such cases, adjusted predictions or marginal means provide a more fitting solution. The use of visual aids can also assist in the comprehension of such effects or predictions, providing an intuitive understanding of the relationship between predictors and outcomes, even for complex models.

If we want the marginal effects for “Land_types”, you may use margins() function of “margins” package:

Code

margins::margins(fit.probit, variables = "Preg")

   Preg
 0.2735

we get the same marginal effect using avg_slopes() function from the {marginaleffects} package

Code

marginaleffects::avg_slopes(fit.probit, variables = "Preg")


 Estimate Std. Error   z Pr(>|z|)   S  2.5 % 97.5 %
    0.273      0.101 2.7  0.00698 7.2 0.0748  0.472

Term: Preg
Type:  response 
Comparison: dY/dX

To calculate marginal effects and adjusted predictions, the predict_response() function is used. This function can return three types of predictions, namely, conditional effects, marginal effects or marginal means, and average marginal effects or counterfactual predictions. You can set the type of prediction you want by using the margin argument.

Code

effect<-ggeffects::predict_response(fit.probit, "Preg", margin = "empirical")
effect

# Average predicted probabilities of Outcome

Preg | Predicted |     95% CI
-----------------------------
0.00 |      0.22 | 0.16, 0.29
0.12 |      0.26 | 0.21, 0.31
0.24 |      0.30 | 0.26, 0.35
0.35 |      0.35 | 0.29, 0.41
0.41 |      0.37 | 0.30, 0.44
0.53 |      0.42 | 0.32, 0.52
0.65 |      0.47 | 0.33, 0.60
1.00 |      0.61 | 0.37, 0.82


Not all rows are shown in the output. Use `print(..., n = Inf)` to show
  all rows.

The relative marginal effects:

Code

effect$predicted[2] - effect$predicted[1]

[1] 0.01891575

{ggeffects} supports labelled data and the plot()- method automatically sets titles, axis - and legend-labels depending on the value and variable labels of the data.

Code

preg.bmi <- predict_response(fit.probit, terms = c("Preg", "BMI"))
plot(preg.bmi, facets = TRUE)

Code

preg.bmi.bp <- predict_response(fit.probit, terms = c("Preg", "BMI", "BP"))
plot(preg.bmi.bp, facets = TRUE)

Code

# select specific levels for grouping terms
ggplot(preg.bmi.bp, aes(x = x, y = predicted, colour = group)) +
  stat_smooth(method = "lm", se = FALSE) +
  facet_wrap(~facet) +
  labs(
    y = get_y_title(preg.bmi.bp),
    x = get_x_title(preg.bmi.bp),
    colour = get_legend_title(preg.bmi.bp)
  )

Relationship between Z-score Outcome and Continuous Covariates

To evaluate the validity the assumption linear relationship between z-score outcome and continuous covariates, we can employ the probit() function to predict the outcome in z-score units. We can then save these predictions to our dataset and create a scatter plot that displays the predicted log odds vs the independent variables. To generate the scatter plot, we should create a new dataset that only contains observations from the model and numeric covariates, excluding missing values.

By plotting the predicted log odds against each independent variable, we can visually examine the relationship between the independent variables and the predicted outcome. The lowess line, which is a smoothing line that estimates the underlying trend in the data, can help us identify any non-linear relationships or patterns in the data.

This approach can help us determine if there are any significant relationships between the independent variables and the predicted outcome, and can help us identify any potential issues with our model assumptions or data.

Code

train.covariates <- train %>%
  select(Preg, Gluc, Insulin, BP, BMI, Age)
# save names of predictors to plug into command below.
  train.predictors <- colnames(train.covariates)

# Save predicted probabilities
  train.covariates$Prob <- fit.probit$fitted.values

# Manually calculate the probit values and tidy data for the plot
train.covariates<- train.covariates %>%
    mutate(probit = qnorm(Prob)) %>%
    select(-Prob) %>%
    gather(key = "predictors", value = "predictor.value", -probit)

# Then you can plot logit values over each of the numeric variables.
  ggplot(train.covariates, aes(y = probit, x = predictor.value))+
    geom_point(size = 0.5, alpha = 0.5) +
    geom_smooth(method = "loess") +
    theme_bw() +
    facet_wrap(~predictors, scales = "free_x") +
    ggtitle("Training Data")

Cross-validation

Code

# Set number of folds
k <- 10
folds <- cut(seq(1, nrow(mf)), breaks = k, labels = FALSE)

cv_probits <- numeric(nrow(mf))  # To store probit values for each observation

for (i in 1:k) {
  # Split the data into training and validation sets
  validation_indices <- which(folds == i, arr.ind = TRUE)
  validation_set <- mf[validation_indices, ]
  training_set <- mf[-validation_indices, ]
  
  # Fit the model on training set
  cv_model <- glm(Outcome~ ., data = mf, family = binomial(link = "probit"))
  
  # Predict probabilities on validation set
  validation_pred <- predict(cv_model, newdata = validation_set, type = "response")
  
  # Store the predicted probabilities
  cv_probits[validation_indices] <- validation_pred
}

# Calculate probit values from predicted probabilities
# Using the qnorm function to get probit values

mf.covariates <- mf %>%
  select(Preg, Gluc, Insulin, BP, BMI, Age)
  mf.predictors <- colnames(mf.covariates)

mf.covariates$probit <- qnorm(cv_probits)


# View the first few probit values
head(mf.covariates)

# A tibble: 6 × 7
   Preg  Gluc Insulin    BP   BMI   Age probit
  <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
1     6   148       0    72  33.6    50  0.566
2     1    85       0    66  26.6    31 -1.71 
3     8   183       0    64  23.3    32  0.726
4     1    89      94    66  28.1    21 -1.82 
5     0   137     168    40  43.1    33  1.19 
6     5   116       0    74  25.6    30 -1.04

Code

# Manually calculate the probit values and tidy data for the plot
mf.covariates<- mf.covariates %>%
    gather(key = "predictors", value = "predictor.value", -probit)
# Then you can plot logit values over each of the numeric variables.
  ggplot(mf.covariates, aes(y = probit, x = predictor.value))+
    geom_point(size = 0.5, alpha = 0.5) +
    geom_smooth(method = "loess") +
    theme_bw() +
    facet_wrap(~predictors, scales = "free_x")+
    ggtitle ("Cross-validation")

Prediction at Test Data

The predict() function for logistic models returns the default predictions of log-odds, which are probabilities on the logit scale. When type = response, the function provides the predicted probabilities.

Code

test$probs <- predict(fit.probit, test, returnData = TRUE,  type = "response")

Code

ggpredict(fit.probit, terms = "Preg[2:8, by = 1]")

# Predicted probabilities of Outcome

Preg | Predicted |     95% CI
-----------------------------
   2 |      0.91 | 0.48, 1.00
   3 |      0.99 | 0.59, 1.00
   4 |      1.00 | 0.70, 1.00
   5 |      1.00 | 0.79, 1.00
   6 |      1.00 | 0.86, 1.00
   7 |      1.00 | 0.91, 1.00
   8 |      1.00 | 0.95, 1.00

Adjusted for:
*    Gluc = 0.60
*      BP = 0.57
*    Skin = 0.20
* Insulin = 0.09
*     BMI = 0.47
*     DPF = 0.16
*     Age = 0.19

Code

test$class <- ifelse(test$probs  > 0.5,"1","0")

Code

test.covariates <- test %>%
  select(Preg, Gluc, Insulin, BP, BMI, Age)
# save names of predictors to plug into command below.
  test.predictors <- colnames(test.covariates)

# Save predicted probabilities
  test.covariates$Prob <- predict(fit.probit, test, returnData = TRUE,  type = "response")

# Manually calculate the probit values and tidy data for the plot
test.covariates<- test.covariates %>%
    mutate(probit = qnorm(Prob)) %>%
    select(-Prob) %>%
    gather(key = "predictors", value = "predictor.value", -probit)
# Then you can plot logit values over each of the numeric variables.
  ggplot(test.covariates, aes(y = probit, x = predictor.value))+
    geom_point(size = 0.5, alpha = 0.5) +
    geom_smooth(method = "loess") +
    theme_bw() +
    facet_wrap(~predictors, scales = "free_x")+
    ggtitle ("Test Data")

Summary and Conclusion

This R tutorial provides an overview of the Probit model, a statistical tool for modeling binary outcome variables. It guides users from understanding the theoretical foundations of the Probit model to implementing it in R. The Probit model is used when the dependent variable is binary and assumes that the relationship between independent variables and the probability of the event follows a cumulative normal distribution. The tutorial explains how to fit a model from scratch by calculating the likelihood function and using optimization techniques to obtain parameter estimates. It also demonstrates fitting a Probit model using R’s built-in glm function with the binomial(link = "probit") argument. This method streamlines the modeling process and provides functionalities for diagnostic checks. Participants learn to interpret coefficients, which reflect changes in the Z-score for unit changes in the predictors. To assess model performance, the tutorial employs cross-validation and evaluation on a hold-out test dataset. Cross-validation tests the model’s stability across different data subsets, while the hold-out dataset evaluates accuracy, sensitivity, and specificity. By the end of the tutorial, participants gain a solid understanding of the Probit model, learning to fit models, interpret results, and evaluate performance. This equips them to effectively apply Probit regression in their research, making informed decisions based on the model’s findings.

References

Session Info

Code

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggeffects_2.2.0        marginaleffects_0.25.0 margins_0.3.28        
 [4] performance_0.13.0     report_0.6.1           jtools_2.3.0          
 [7] gtsummary_2.1.0        sjPlot_2.8.17          ggstatsplot_0.13.0    
[10] corrplot_0.95          RColorBrewer_1.1-3     patchwork_1.3.0       
[13] DataExplorer_0.8.3     rstatix_0.7.2          plyr_1.8.9            
[16] lubridate_1.9.4        forcats_1.0.0          stringr_1.5.1         
[19] dplyr_1.1.4            purrr_1.0.4            readr_2.1.5           
[22] tidyr_1.3.1            tibble_3.2.1           ggplot2_3.5.1         
[25] tidyverse_2.0.0       

loaded via a namespace (and not attached):
  [1] rstudioapi_0.17.1      jsonlite_1.9.0         datawizard_1.0.2      
  [4] correlation_0.8.7      magrittr_2.0.3         TH.data_1.1-3         
  [7] estimability_1.5.1     farver_2.1.2           rmarkdown_2.29        
 [10] vctrs_0.6.5            paletteer_1.6.0        base64enc_0.1-3       
 [13] effectsize_1.0.0       htmltools_0.5.8.1      curl_6.2.1            
 [16] haven_2.5.4            broom_1.0.7            Formula_1.2-5         
 [19] sjmisc_2.8.10          sass_0.4.9             parallelly_1.42.0     
 [22] htmlwidgets_1.6.4      sandwich_3.1-1         emmeans_1.11.0        
 [25] zoo_1.8-13             gt_0.11.1              commonmark_1.9.2      
 [28] networkD3_0.4          igraph_2.1.4           lifecycle_1.0.4       
 [31] pkgconfig_2.0.3        sjlabelled_1.2.0       Matrix_1.7-1          
 [34] R6_2.6.1               fastmap_1.2.0          future_1.34.0         
 [37] digest_0.6.37          colorspace_2.1-1       furrr_0.3.1           
 [40] rematch2_2.1.2         prismatic_1.1.2        labeling_0.4.3        
 [43] timechange_0.3.0       mgcv_1.9-1             abind_1.4-8           
 [46] compiler_4.4.2         bit64_4.6.0-1          withr_3.0.2           
 [49] pander_0.6.5           backports_1.5.0        carData_3.0-5         
 [52] prediction_0.3.18      broom.mixed_0.2.9.6    MASS_7.3-64           
 [55] sjstats_0.19.0         tools_4.4.2            statsExpressions_1.6.2
 [58] glue_1.8.0             nlme_3.1-166           grid_4.4.2            
 [61] checkmate_2.3.2        see_0.11.0             generics_0.1.3        
 [64] gtable_0.3.6           labelled_2.14.0        tzdb_0.4.0            
 [67] data.table_1.17.0      hms_1.1.3              xml2_1.3.6            
 [70] car_3.1-3              utf8_1.2.4             ggrepel_0.9.6         
 [73] markdown_1.13          pillar_1.10.1          vroom_1.6.5           
 [76] splines_4.4.2          lattice_0.22-5         survival_3.8-3        
 [79] bit_4.5.0.1            tidyselect_1.2.1       knitr_1.49            
 [82] gridExtra_2.3          svglite_2.1.3          xfun_0.51             
 [85] stringi_1.8.4          yaml_2.3.10            kableExtra_1.4.0      
 [88] evaluate_1.0.3         codetools_0.2-20       cli_3.6.4             
 [91] RcppParallel_5.1.10    xtable_1.8-4           parameters_0.24.2     
 [94] systemfonts_1.2.1      munsell_0.5.1          Rcpp_1.0.14           
 [97] globals_0.16.3         zeallot_0.1.0          coda_0.19-4.1         
[100] parallel_4.4.2         rstantools_2.4.0       bayestestR_0.15.2     
[103] listenv_0.9.1          viridisLite_0.4.2      broom.helpers_1.20.0  
[106] mvtnorm_1.3-3          scales_1.3.0           insight_1.1.0         
[109] crayon_1.5.3           rlang_1.1.5            multcomp_1.4-28       
[112] cards_0.5.0.9000