6.5. Hurdle Model

In count data analysis, we often encounter datasets with excessive zeros. These zeros can be problematic for standard Poisson or negative binomial models, as they typically assume that zero counts arise from the same process as positive counts. However, in many cases, zeros and positive counts result from distinct processes. For example, in modeling the number of visits to a gym, some people never visit (true zeros), while others visit at varying frequencies. The hurdle model is designed to handle this by splitting the modeling process into two parts: one to handle zero vs. non-zero outcomes and another to model the positive counts.

This tutorial introduces hurdle models in R, where we’ll start with a model overview, explaining how hurdle models work to separate zero and count components. We’ll then cover how to fit hurdle models from scratch to understand their two-part structure, and we’ll use the {pscl} package for efficient model fitting. Finally, we’ll discuss model comparison techniques, such as the likelihood ratio and Vuong tests, to assess fit quality and evaluate prediction performance on test data to validate the model.

Overview

A hurdle model is a type of statistical model used to analyze count data, particularly when the data has an excess number of zeros (zero-inflated data). It is commonly applied in scenarios where there are two separate processes: one that governs whether an event occurs (i.e., whether the count is zero or non-zero) and another that determines the count value once the event occurs.

Key Features of the Hurdle Model

  1. Two-Part Structure:
    • First part (Hurdle component): This models the probability that the count is greater than zero (i.e., whether the “hurdle” is crossed). Typically, this is done using a binary model, like logistic regression or probit regression.
    • Second part (Count component): This models the count of events given that the count is greater than zero. Common models for this part include Poisson or negative binomial regression, depending on the distribution of the non-zero counts.
  2. Zero-Inflation Handling:
    • The hurdle model is specifically designed to address data where zeros are more frequent than would be expected from standard count models, like Poisson or negative binomial models. This makes it useful in contexts like healthcare (e.g., number of hospital visits), economics (e.g., number of purchases), or environmental science (e.g., number of rare species observed).

Example

Consider data on the number of times people visit a doctor in a year. Many people may not visit a doctor at all (zeros), while others visit a variable number of times. In this case: - The first part of the hurdle model would predict the likelihood that someone visits the doctor at least once. - The second part would model the number of visits, conditional on having at least one visit.

Difference from Zero-Inflated Models

Although both hurdle models and zero-inflated models address excess zeros, they are conceptually different: - Hurdle models assume a zero outcome is a result of a distinct process (e.g., some people never visit the doctor). - Zero-inflated models assume there are two types of zeros: “structural zeros” (those who never visit the doctor) and “sampling zeros” (those who could visit but didn’t in a given period).

As we discuss before, the hurdle model consists of two components:

1. Binary Component: A logistic (or binomial) regression that models whether the count is zero or non-zero (i.e., whether the hurdle is crossed).

This part of the model predicts whether the outcome is zero or positive.

\[ P(Y_i = 0 | X) = \frac{1}{1 + \exp(X\beta)} \]

Where:

\(Y_i = 0\) indicates that the count is zero.

- \(X\) represents the predictor variables.

- \(\beta\) represents the coefficients estimated by logistic regression.

This component uses logistic regression to estimate the probability of zeros (whether or not the “hurdle” is crossed).

2. Truncated Count Component (Positive count): Once the hurdle has been crossed (i.e., we have a positive count), the second part of the model predicts the non-zero counts. This part models only the positive counts and ignores the zeros.

For a truncated Poisson regression, the probability mass function is:

\[ P(Y_i = k | Y_i > 0, X) = \frac{\lambda^k e^{-\lambda}}{k!(1 - e^{-\lambda})} \]

Where: -

\(k > 0\) represents the non-zero counts.

- \(\lambda = \exp(X\gamma)\) is the mean of the Poisson distribution, with \(X\) as the predictors and \(\gamma\)) as the coefficients.

- \(e^{-\lambda}\) is the probability of zero in the untruncated Poisson model.

For a truncated Negative Binomial regression, the equation would incorporate the dispersion parameter \(\theta\) to handle overdispersion:

\[ P(Y_i = k | Y_i > 0, X) = \frac{\Gamma(k + \theta^{-1})}{k! \Gamma(\theta^{-1})} \left( \frac{\lambda}{\lambda + \theta^{-1}} \right)^k \left( \frac{\theta^{-1}}{\lambda + \theta^{-1}} \right)^{\theta^{-1}} \]

Where: - \(\theta\) is the dispersion parameter that accounts for overdispersion in the data. - \(\gamma\) is the Gamma function.

The full likelihood for the Hurdle Model combines the two parts:

\[ P(Y_i | X) = \begin{cases} P(Y_i = 0 | X) & \text{if } Y_i = 0 \\ P(Y_i = k | Y_i > 0, X) & \text{if } Y_i > 0 \end{cases} \]

Fit a Hurdle model

To fit a hurdle model manually in R using synthetic data without any R packages, you will implement both the binary and count components of the hurdle model from scratch. Below, I provide a complete example, including data generation, model fitting for the zero and count parts, and predictions.

Data Generation

The counts variable simulates a zero-inflated count variable using Poisson distribution and introduces excess zeros based on a probability.

Code
# Step 1: Generate Synthetic Data
set.seed(42)
n <- 200  # Number of observations
x <- rnorm(n)  # Independent variable
lambda <- exp(0.5 * x)  # Rate for Poisson distribution
zero_inflation_prob <- 0.7  # Probability of excess zeros

# Generate counts based on the zero-inflated process
counts <- rpois(n, lambda)  # Poisson distributed counts
counts <- ifelse(runif(n) < zero_inflation_prob, 0, counts)  # Introduce excess zeros

# Combine into a data frame
data <- data.frame(counts = counts, x = x)
head(data)
  counts          x
1      0  1.3709584
2      0 -0.5646982
3      1  0.3631284
4      0  0.6328626
5      0  0.4042683
6      0 -0.1061245

Fit a Binary Model: Implement logistic regression manually to predict whether the count is zero or non-zero.

Code
# Step 2: Fit the Binary Model (Logistic Regression)
# Model to predict whether counts are greater than zero
binary_model_fit <- function(x, y) {
  # Create the logistic function
  logistic <- function(beta0, beta1, x) {
    1 / (1 + exp(-(beta0 + beta1 * x)))
  }
  
  # Initialize coefficients
  beta <- c(0, 0)  # beta0, beta1
  learning_rate <- 0.01
  n_iterations <- 1000
  
  # Gradient Descent to optimize beta
  for (i in 1:n_iterations) {
    p_hat <- logistic(beta[1], beta[2], x)
    
    # Calculate the gradients
    grad_beta0 <- sum((y - p_hat) * p_hat * (1 - p_hat)) / length(y)
    grad_beta1 <- sum((y - p_hat) * p_hat * (1 - p_hat) * x) / length(y)
    
    # Update the coefficients
    beta[1] <- beta[1] + learning_rate * grad_beta0
    beta[2] <- beta[2] + learning_rate * grad_beta1
  }
  return(beta)
}

# Fit binary model
binary_response <- as.numeric(data$counts > 0)  # 1 if counts > 0, else 0
binary_coefficients <- binary_model_fit(data$x, binary_response)

Fit a Count Model: Implement Poisson regression manually for positive counts.

Code
# Step 3: Fit the Count Model (Poisson Regression)
# Only use positive counts for fitting
poisson_model_fit <- function(x, y) {
  # Create the log link function for Poisson regression
  log_link <- function(beta0, beta1, x) {
    beta0 + beta1 * x
  }
  
  # Initialize coefficients
  beta <- c(0, 0)  # beta0, beta1
  learning_rate <- 0.01
  n_iterations <- 1000
  
  # Gradient Descent to optimize beta
  for (i in 1:n_iterations) {
    lambda_hat <- exp(log_link(beta[1], beta[2], x))
    
    # Calculate the gradients
    grad_beta0 <- sum((y - lambda_hat) / lambda_hat) / length(y)
    grad_beta1 <- sum((y - lambda_hat) / lambda_hat * x) / length(y)
    
    # Update the coefficients
    beta[1] <- beta[1] + learning_rate * grad_beta0
    beta[2] <- beta[2] + learning_rate * grad_beta1
  }
  return(beta)
}

# Fit Poisson model using positive counts
positive_counts <- data[data$counts > 0, ]
poisson_coefficients <- poisson_model_fit(positive_counts$x, positive_counts$counts)

Combine Predictions: Use the predictions from both models to get the final predicted counts.

Code
# Step 4: Combine Predictions
# Predict probabilities of non-zero counts
logistic_function <- function(beta0, beta1, x) {
  1 / (1 + exp(-(beta0 + beta1 * x)))
}
data$prob_nonzero <- logistic_function(binary_coefficients[1], binary_coefficients[2], data$x)

# Predict expected counts for positive cases
data$expected_counts <- ifelse(data$counts > 0, exp(poisson_coefficients[1] + poisson_coefficients[2] * data$x), 0)

# Combine predictions for final results
data$hurdle_predicted_counts <- ifelse(data$counts > 0, data$expected_counts * data$prob_nonzero, 0)

# View results
head(data)
  counts          x prob_nonzero expected_counts hurdle_predicted_counts
1      0  1.3709584    0.3870800        0.000000               0.0000000
2      0 -0.5646982    0.3307332        0.000000               0.0000000
3      1  0.3631284    0.3572533        1.771193               0.6327645
4      0  0.6328626    0.3651391        0.000000               0.0000000
5      0  0.4042683    0.3584512        0.000000               0.0000000
6      0 -0.1061245    0.3437189        0.000000               0.0000000
Code
# Summary statistics
summary(data)
     counts            x             prob_nonzero    expected_counts
 Min.   :0.000   Min.   :-2.99309   Min.   :0.2665   Min.   :0.000  
 1st Qu.:0.000   1st Qu.:-0.61011   1st Qu.:0.3295   1st Qu.:0.000  
 Median :0.000   Median :-0.01643   Median :0.3463   Median :0.000  
 Mean   :0.285   Mean   :-0.02748   Mean   :0.3465   Mean   :0.283  
 3rd Qu.:0.000   3rd Qu.: 0.63363   3rd Qu.:0.3652   3rd Qu.:0.000  
 Max.   :6.000   Max.   : 2.70189   Max.   :0.4278   Max.   :3.228  
 hurdle_predicted_counts
 Min.   :0.0000         
 1st Qu.:0.0000         
 Median :0.0000         
 Mean   :0.1034         
 3rd Qu.:0.0000         
 Max.   :1.2960         

Fit a Hurdle model in R

To fit a hurdle model using the hurdle() function from the {pscl} package in R with the NMES1988 dataset, follow below steps:

Install R-packages

Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:

Code
packages <- c('tidyverse',
     'plyr',
      'DataExplorer',
         'dlookr',
         'rstatix',
         'gtsummary',
         'performance',
         'jtools',
         'margins',
         'marginaleffects',
         'ggeffects',
         'patchwork',
         'Metrics',
         'ggpmisc',
         'epiDisplay',
         'sandwich',
       "AER",
       "VGAM",
       "MASS",
       "pscl"
          )
#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Load R-packages

Code
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))
Code
# Check loaded packages
cat("Successfully loaded packages:\n")
Successfully loaded packages:
Code
print(search()[grepl("package:", search())])
 [1] "package:pscl"            "package:VGAM"           
 [3] "package:splines"         "package:stats4"         
 [5] "package:AER"             "package:lmtest"         
 [7] "package:zoo"             "package:car"            
 [9] "package:carData"         "package:sandwich"       
[11] "package:epiDisplay"      "package:nnet"           
[13] "package:MASS"            "package:survival"       
[15] "package:foreign"         "package:ggpmisc"        
[17] "package:ggpp"            "package:Metrics"        
[19] "package:patchwork"       "package:ggeffects"      
[21] "package:marginaleffects" "package:margins"        
[23] "package:jtools"          "package:performance"    
[25] "package:gtsummary"       "package:rstatix"        
[27] "package:dlookr"          "package:DataExplorer"   
[29] "package:plyr"            "package:lubridate"      
[31] "package:forcats"         "package:stringr"        
[33] "package:dplyr"           "package:purrr"          
[35] "package:readr"           "package:tidyr"          
[37] "package:tibble"          "package:ggplot2"        
[39] "package:tidyverse"       "package:stats"          
[41] "package:graphics"        "package:grDevices"      
[43] "package:utils"           "package:datasets"       
[45] "package:methods"         "package:base"           

Data

We will use physician office visits data set (NMES1988) from AER package. . It represents a sample of 4,406 individuals aged 66 and over who were covered by Medicare in 1988. One of the variables in the data is the number of physician office visits. If we want to create a model for the number of visits using some of the other variables in the dataset, we need to start by loading the data. You may also need to install the AER package.

A data frame containing 4,406 observations on 19 variables. We will use following variables:

  • visits- Number of physician office visits.
  • hospital - Number of hospital stays.
  • health - Factor indicating self-perceived health status, levels are “poor”, “average” (reference category), “excellent”.
  • chronic - Number of chronic conditions.
  • age - Age in years (divided by 10).
  • afam - Factor. Is the individual African-American?
  • gender - Factor indicating gender.
  • married- Factor. is the individual married?
  • school - Number of years of education.
  • income- Family income in USD 10,000.
  • employed - Factor. Is the individual employed?
  • insurance- Factor. Is the individual covered by private insurance?
  • medicaid Factor - Is the individual covered by Medicaid?
Code
# load the data
data(NMES1988)
# select variables
df<-NMES1988 |> 
  dplyr::select(visits,
                hospital,
                health, 
                chronic,  
                age, 
                afam, 
                gender, 
                married, 
                school, 
                income,
                employed, 
                insurance, 
                medicaid)  

# split  data
seeds = 11076
tr_prop = 0.70

# training data (70% data)
train= ddply(df,.(gender, afam),
                 function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test = ddply(df, .(gender, afam),
            function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)

Fit a hurdle Model

Now, we’ll fit a hurdle model using hurdle() function of {pscl} package. We’ll model the binary part (whether or not the person visited the doctor) and the count part (how many visits, given that they visited). We will use dist = "poisson" for the data with Piosson distribution.

Code
# Fit the hurdle model using the pscl package
fit.hurdle.pois <- hurdle(
              visits  ~ 
                hospital +
                health +
                chronic +
                age +
                afam +
                gender +
                married +
                school +
                income +
                employed +
                insurance +
                medicaid,
                data = train, 
                 zero.dist = "binomial",
                 dist =  "poisson")
# Summary of the hurdle model
summary(fit.hurdle.pois)

Call:
hurdle(formula = visits ~ hospital + health + chronic + age + afam + 
    gender + married + school + income + employed + insurance + medicaid, 
    data = train, dist = "poisson", zero.dist = "binomial")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-4.5649 -1.1383 -0.4630  0.5496 19.3241 

Count model coefficients (truncated poisson with log link):
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.920082   0.106057  18.104  < 2e-16 ***
hospital         0.152069   0.007313  20.794  < 2e-16 ***
healthpoor       0.214664   0.021576   9.949  < 2e-16 ***
healthexcellent -0.361636   0.038574  -9.375  < 2e-16 ***
chronic          0.091637   0.005860  15.637  < 2e-16 ***
age             -0.067319   0.012989  -5.183 2.19e-07 ***
afamyes         -0.013614   0.027013  -0.504  0.61428    
gendermale      -0.057532   0.017816  -3.229  0.00124 ** 
marriedyes      -0.074486   0.018247  -4.082 4.46e-05 ***
school           0.016566   0.002357   7.028 2.09e-12 ***
income          -0.001941   0.002886  -0.672  0.50131    
employedyes     -0.018341   0.027876  -0.658  0.51056    
insuranceyes     0.158768   0.024306   6.532 6.48e-11 ***
medicaidyes      0.211406   0.030955   6.830 8.52e-12 ***
Zero hurdle model coefficients (binomial with logit link):
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -1.562364   0.700684  -2.230  0.02576 *  
hospital         0.345241   0.115328   2.994  0.00276 ** 
healthpoor      -0.117102   0.190984  -0.613  0.53977    
healthexcellent -0.330736   0.169567  -1.950  0.05112 .  
chronic          0.587699   0.055765  10.539  < 2e-16 ***
age              0.186709   0.088034   2.121  0.03393 *  
afamyes         -0.349520   0.154264  -2.266  0.02347 *  
gendermale      -0.452056   0.115489  -3.914 9.07e-05 ***
marriedyes       0.208622   0.122905   1.697  0.08962 .  
school           0.060100   0.014993   4.009 6.11e-05 ***
income           0.046390   0.025866   1.793  0.07290 .  
employedyes     -0.006166   0.172204  -0.036  0.97144    
insuranceyes     0.673297   0.132793   5.070 3.97e-07 ***
medicaidyes      0.452972   0.202256   2.240  0.02512 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 20 
Log-likelihood: -1.099e+04 on 28 Df

Alternatively, you could use negative binomial distribution if overdispersion is present. Here use dist = "negbin" if you expect overdispersion in the count data.

Code
# Fit the hurdle model using the pscl package
fit.hurdle.nb <- hurdle(
                visits  ~ 
                hospital +
                health +
                chronic +
                age +
                afam +
                gender +
                married +
                school +
                income +
                employed +
                insurance +
                medicaid,
                data = train, 
                zero.dist = "binomial",
                dist =  "negbin")
summary(fit.hurdle.nb )

Call:
hurdle(formula = visits ~ hospital + health + chronic + age + afam + 
    gender + married + school + income + employed + insurance + medicaid, 
    data = train, dist = "negbin", zero.dist = "binomial")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.1992 -0.7122 -0.2742  0.3281 13.3177 

Count model coefficients (truncated negbin with log link):
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.7525249  0.2597264   6.748 1.50e-11 ***
hospital         0.2097844  0.0254244   8.251  < 2e-16 ***
healthpoor       0.2715317  0.0562376   4.828 1.38e-06 ***
healthexcellent -0.4137847  0.0785668  -5.267 1.39e-07 ***
chronic          0.1127509  0.0148089   7.614 2.66e-14 ***
age             -0.0688877  0.0321429  -2.143 0.032100 *  
afamyes         -0.0264495  0.0652791  -0.405 0.685348    
gendermale      -0.0520124  0.0428769  -1.213 0.225105    
marriedyes      -0.1016623  0.0446715  -2.276 0.022859 *  
school           0.0181822  0.0054619   3.329 0.000872 ***
income          -0.0006644  0.0071340  -0.093 0.925799    
employedyes     -0.0112664  0.0652961  -0.173 0.863011    
insuranceyes     0.1701965  0.0567624   2.998 0.002714 ** 
medicaidyes      0.2104205  0.0769961   2.733 0.006278 ** 
Log(theta)       0.3910325  0.0507931   7.699 1.38e-14 ***
Zero hurdle model coefficients (binomial with logit link):
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -1.562364   0.700684  -2.230  0.02576 *  
hospital         0.345241   0.115328   2.994  0.00276 ** 
healthpoor      -0.117102   0.190984  -0.613  0.53977    
healthexcellent -0.330736   0.169567  -1.950  0.05112 .  
chronic          0.587699   0.055765  10.539  < 2e-16 ***
age              0.186709   0.088034   2.121  0.03393 *  
afamyes         -0.349520   0.154264  -2.266  0.02347 *  
gendermale      -0.452056   0.115489  -3.914 9.07e-05 ***
marriedyes       0.208622   0.122905   1.697  0.08962 .  
school           0.060100   0.014993   4.009 6.11e-05 ***
income           0.046390   0.025866   1.793  0.07290 .  
employedyes     -0.006166   0.172204  -0.036  0.97144    
insuranceyes     0.673297   0.132793   5.070 3.97e-07 ***
medicaidyes      0.452972   0.202256   2.240  0.02512 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 1.4785
Number of iterations in BFGS optimization: 22 
Log-likelihood: -8394 on 29 Df
  • The binary part of the hurdle model estimates the factors associated with the likelihood of visiting the doctor at all.

  • The count part estimates the factors associated with the number of doctor visits for those who visited at least once.

You can inspect the coefficients and interpret their significance:

  • A positive coefficient in the binary part (logistic regression) means the predictor increases the likelihood of visiting the doctor.

  • A positive coefficient in the count part means the predictor increases the number of visits among those who visit the doctor.

We know that each component of a hurdle model can have different sets of predictors. We can do this in the hurdle(0) function by using “|” in the model formula. For example, let’s say we want to fit the zero hurdle component using only the insurance and gender predictors. We can do that as follows:

Model below fist the count data model (visits regressed on all other variables) conditional on the zero hurdle model (visits regressed on gender and insurance).

Code
# Fit the hurdle model using the pscl package
fit.hurdle.nb.01 <- hurdle(
                visits ~ . | gender + insurance, 
                data = train, 
                zero.dist = "binomial",
                dist =  "negbin")
summary(fit.hurdle.nb.01)

Call:
hurdle(formula = visits ~ . | gender + insurance, data = train, dist = "negbin", 
    zero.dist = "binomial")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.1094 -0.7330 -0.2616  0.3502 13.2554 

Count model coefficients (truncated negbin with log link):
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.7525249  0.2597264   6.748 1.50e-11 ***
hospital         0.2097844  0.0254244   8.251  < 2e-16 ***
healthpoor       0.2715317  0.0562376   4.828 1.38e-06 ***
healthexcellent -0.4137847  0.0785668  -5.267 1.39e-07 ***
chronic          0.1127509  0.0148089   7.614 2.66e-14 ***
age             -0.0688877  0.0321429  -2.143 0.032100 *  
afamyes         -0.0264495  0.0652791  -0.405 0.685348    
gendermale      -0.0520124  0.0428769  -1.213 0.225105    
marriedyes      -0.1016623  0.0446715  -2.276 0.022859 *  
school           0.0181822  0.0054619   3.329 0.000872 ***
income          -0.0006644  0.0071340  -0.093 0.925799    
employedyes     -0.0112664  0.0652961  -0.173 0.863011    
insuranceyes     0.1701965  0.0567624   2.998 0.002714 ** 
medicaidyes      0.2104205  0.0769961   2.733 0.006278 ** 
Log(theta)       0.3910325  0.0507931   7.699 1.38e-14 ***
Zero hurdle model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.28807    0.09831  13.102  < 2e-16 ***
gendermale   -0.40401    0.10023  -4.031 5.56e-05 ***
insuranceyes  0.76312    0.10774   7.083 1.41e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 1.4785
Number of iterations in BFGS optimization: 22 
Log-likelihood: -8512 on 18 Df

Model Comparison

Likelihood Ratio Test (LRT)

Code
# 1. Extract log-likelihoods
logLik.hurdle.pois <- logLik(fit.hurdle.pois)
logLik.hurdle.nb<- logLik(fit.hurdle.nb)
# 2. Compute the LRT statistic
lrt_stat <- 2 * (logLik.hurdle.nb- logLik.hurdle.pois)
# 3. Degrees of freedom difference (df)
df_diff <- attr(logLik.hurdle.nb, "df") - attr(logLik.hurdle.pois, "df")
# 4. Compute p-value (chi-squared distribution)
p_value <- pchisq(lrt_stat, df = df_diff, lower.tail = FALSE)
# Output LRT statistic and p-value
cat("LRT Statistic:", lrt_stat, "\n")
LRT Statistic: 5192.7 
Code
cat("Degrees of Freedom Difference:", df_diff, "\n")
Degrees of Freedom Difference: 1 
Code
cat("p-value:", p_value, "\n")
p-value: 0 

Vuong test

Code
vuong(fit.hurdle.nb,fit.hurdle.pois)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                    11.86906 model1 > model2 < 2.22e-16
AIC-corrected          11.86906 model1 > model2 < 2.22e-16
BIC-corrected          11.86906 model1 > model2 < 2.22e-16

Prediction Performance

The predict() function will be used to predict the number of diabetes patients the test counties. This will help to validate the accuracy of the these regression models.

Code
# Prediction
test$Pred.hurdle.pois<-predict(fit.hurdle.pois, test, type = "response")
test$Pred.hurdle.nb<-predict(fit.hurdle.nb, test, type = "response")

# RMSE
rmse.pois<-Metrics::rmse(test$visits, test$Pred.hurdle.pois)
rmse.nb<-Metrics::rmse(test$visits, test$Pred.hurdle.nb)

cat("Hurdlermse Poisson:" ,rmse.pois, "\n")
Hurdlermse Poisson: 6.977813 
Code
cat("Hurdlermse NB:", rmse.nb, "\n")
Hurdlermse NB: 7.040287 

Summary and Conclusions

In this tutorial, we explored the hurdle model in R, a powerful tool for analyzing count data with an excess of zeros, commonly found in healthcare, economics, and other fields. We applied the hurdle model using both Poisson and negative binomial distributions to model two key processes: whether an individual visits the doctor at all and, if they do, how often they visit.

By fitting and interpreting the hurdle model in R, we’ve demonstrated how to address complex zero-inflated count data scenarios, providing a clearer understanding of the relationships between predictors and outcomes. This method can be extended to other real-world problems where excess zeros and over-dispersion are prevalent.

References

  1. Hurdle Model

  2. Getting Started with Hurdle Models

Session Info

Code
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] splines   stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] pscl_1.5.9             VGAM_1.1-13            AER_1.2-14            
 [4] lmtest_0.9-40          zoo_1.8-13             car_3.1-3             
 [7] carData_3.0-5          sandwich_3.1-1         epiDisplay_3.5.0.2    
[10] nnet_7.3-20            MASS_7.3-64            survival_3.8-3        
[13] foreign_0.8-88         ggpmisc_0.6.1          ggpp_0.5.8-1          
[16] Metrics_0.1.4          patchwork_1.3.0        ggeffects_2.2.0       
[19] marginaleffects_0.25.0 margins_0.3.28         jtools_2.3.0          
[22] performance_0.13.0     gtsummary_2.1.0        rstatix_0.7.2         
[25] dlookr_0.6.3           DataExplorer_0.8.3     plyr_1.8.9            
[28] lubridate_1.9.4        forcats_1.0.0          stringr_1.5.1         
[31] dplyr_1.1.4            purrr_1.0.4            readr_2.1.5           
[34] tidyr_1.3.1            tibble_3.2.1           ggplot2_3.5.1         
[37] tidyverse_2.0.0       

loaded via a namespace (and not attached):
 [1] rstudioapi_0.17.1       jsonlite_1.9.0          magrittr_2.0.3         
 [4] farver_2.1.2            rmarkdown_2.29          vctrs_0.6.5            
 [7] htmltools_0.5.8.1       polynom_1.4-1           curl_6.2.1             
[10] broom_1.0.7             Formula_1.2-5           parallelly_1.42.0      
[13] htmlwidgets_1.6.4       networkD3_0.4           igraph_2.1.4           
[16] mime_0.12               lifecycle_1.0.4         pkgconfig_2.0.3        
[19] Matrix_1.7-1            R6_2.6.1                fastmap_1.2.0          
[22] future_1.34.0           shiny_1.10.0            digest_0.6.37          
[25] showtext_0.9-7          colorspace_2.1-1        furrr_0.3.1            
[28] timechange_0.3.0        abind_1.4-8             compiler_4.4.2         
[31] fontquiver_0.2.1        withr_3.0.2             pander_0.6.5           
[34] backports_1.5.0         prediction_0.3.18       Rttf2pt1_1.3.12        
[37] broom.mixed_0.2.9.6     quantreg_6.1            tools_4.4.2            
[40] reactable_0.4.4         httpuv_1.6.15           extrafontdb_1.0        
[43] glue_1.8.0              nlme_3.1-166            promises_1.3.2         
[46] grid_4.4.2              generics_0.1.3          gtable_0.3.6           
[49] tzdb_0.4.0              data.table_1.17.0       hms_1.1.3              
[52] xml2_1.3.6              pillar_1.10.1           later_1.4.1            
[55] lattice_0.22-5          showtextdb_3.0          SparseM_1.84-2         
[58] tidyselect_1.2.1        pagedown_0.22           fontLiberation_0.1.0   
[61] knitr_1.49              fontBitstreamVera_0.1.1 gridExtra_2.3          
[64] svglite_2.1.3           xfun_0.51               stringi_1.8.4          
[67] yaml_2.3.10             kableExtra_1.4.0        evaluate_1.0.3         
[70] codetools_0.2-20        extrafont_0.19          gdtools_0.4.1          
[73] cli_3.6.4               xtable_1.8-4            systemfonts_1.2.1      
[76] munsell_0.5.1           Rcpp_1.0.14             globals_0.16.3         
[79] parallel_4.4.2          MatrixModels_0.5-3      listenv_0.9.1          
[82] viridisLite_0.4.2       hrbrthemes_0.8.7        scales_1.3.0           
[85] sysfonts_0.8.9          insight_1.1.0           rlang_1.1.5