In count data analysis, we often encounter datasets with excessive zeros. These zeros can be problematic for standard Poisson or negative binomial models, as they typically assume that zero counts arise from the same process as positive counts. However, in many cases, zeros and positive counts result from distinct processes. For example, in modeling the number of visits to a gym, some people never visit (true zeros), while others visit at varying frequencies. The hurdle model is designed to handle this by splitting the modeling process into two parts: one to handle zero vs. non-zero outcomes and another to model the positive counts.
This tutorial introduces hurdle models in R, where we’ll start with a model overview, explaining how hurdle models work to separate zero and count components. We’ll then cover how to fit hurdle models from scratch to understand their two-part structure, and we’ll use the {pscl} package for efficient model fitting. Finally, we’ll discuss model comparison techniques, such as the likelihood ratio and Vuong tests, to assess fit quality and evaluate prediction performance on test data to validate the model.
Overview
A hurdle model is a type of statistical model used to analyze count data, particularly when the data has an excess number of zeros (zero-inflated data). It is commonly applied in scenarios where there are two separate processes: one that governs whether an event occurs (i.e., whether the count is zero or non-zero) and another that determines the count value once the event occurs.
Key Features of the Hurdle Model
Two-Part Structure:
First part (Hurdle component): This models the probability that the count is greater than zero (i.e., whether the “hurdle” is crossed). Typically, this is done using a binary model, like logistic regression or probit regression.
Second part (Count component): This models the count of events given that the count is greater than zero. Common models for this part include Poisson or negative binomial regression, depending on the distribution of the non-zero counts.
Zero-Inflation Handling:
The hurdle model is specifically designed to address data where zeros are more frequent than would be expected from standard count models, like Poisson or negative binomial models. This makes it useful in contexts like healthcare (e.g., number of hospital visits), economics (e.g., number of purchases), or environmental science (e.g., number of rare species observed).
Example
Consider data on the number of times people visit a doctor in a year. Many people may not visit a doctor at all (zeros), while others visit a variable number of times. In this case: - The first part of the hurdle model would predict the likelihood that someone visits the doctor at least once. - The second part would model the number of visits, conditional on having at least one visit.
Difference from Zero-Inflated Models
Although both hurdle models and zero-inflated models address excess zeros, they are conceptually different: - Hurdle models assume a zero outcome is a result of a distinct process (e.g., some people never visit the doctor). - Zero-inflated models assume there are two types of zeros: “structural zeros” (those who never visit the doctor) and “sampling zeros” (those who could visit but didn’t in a given period).
As we discuss before, the hurdle model consists of two components:
1. Binary Component: A logistic (or binomial) regression that models whether the count is zero or non-zero (i.e., whether the hurdle is crossed).
This part of the model predicts whether the outcome is zero or positive.
\[ P(Y_i = 0 | X) = \frac{1}{1 + \exp(X\beta)} \]
Where:
\(Y_i = 0\) indicates that the count is zero.
- \(X\) represents the predictor variables.
- \(\beta\) represents the coefficients estimated by logistic regression.
This component uses logistic regression to estimate the probability of zeros (whether or not the “hurdle” is crossed).
2. Truncated Count Component (Positive count): Once the hurdle has been crossed (i.e., we have a positive count), the second part of the model predicts the non-zero counts. This part models only the positive counts and ignores the zeros.
For a truncated Poisson regression, the probability mass function is:
To fit a hurdle model manually in R using synthetic data without any R packages, you will implement both the binary and count components of the hurdle model from scratch. Below, I provide a complete example, including data generation, model fitting for the zero and count parts, and predictions.
Data Generation
The counts variable simulates a zero-inflated count variable using Poisson distribution and introduces excess zeros based on a probability.
Code
# Step 1: Generate Synthetic Dataset.seed(42)n <-200# Number of observationsx <-rnorm(n) # Independent variablelambda <-exp(0.5* x) # Rate for Poisson distributionzero_inflation_prob <-0.7# Probability of excess zeros# Generate counts based on the zero-inflated processcounts <-rpois(n, lambda) # Poisson distributed countscounts <-ifelse(runif(n) < zero_inflation_prob, 0, counts) # Introduce excess zeros# Combine into a data framedata <-data.frame(counts = counts, x = x)head(data)
counts x prob_nonzero expected_counts
Min. :0.000 Min. :-2.99309 Min. :0.2665 Min. :0.000
1st Qu.:0.000 1st Qu.:-0.61011 1st Qu.:0.3295 1st Qu.:0.000
Median :0.000 Median :-0.01643 Median :0.3463 Median :0.000
Mean :0.285 Mean :-0.02748 Mean :0.3465 Mean :0.283
3rd Qu.:0.000 3rd Qu.: 0.63363 3rd Qu.:0.3652 3rd Qu.:0.000
Max. :6.000 Max. : 2.70189 Max. :0.4278 Max. :3.228
hurdle_predicted_counts
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.1034
3rd Qu.:0.0000
Max. :1.2960
Fit a Hurdle model in R
To fit a hurdle model using the hurdle() function from the {pscl} package in R with the NMES1988 dataset, follow below steps:
Install R-packages
Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:
We will use physician office visits data set (NMES1988) from AER package. . It represents a sample of 4,406 individuals aged 66 and over who were covered by Medicare in 1988. One of the variables in the data is the number of physician office visits. If we want to create a model for the number of visits using some of the other variables in the dataset, we need to start by loading the data. You may also need to install the AER package.
A data frame containing 4,406 observations on 19 variables. We will use following variables:
visits- Number of physician office visits.
hospital - Number of hospital stays.
health - Factor indicating self-perceived health status, levels are “poor”, “average” (reference category), “excellent”.
chronic - Number of chronic conditions.
age - Age in years (divided by 10).
afam - Factor. Is the individual African-American?
gender - Factor indicating gender.
married- Factor. is the individual married?
school - Number of years of education.
income- Family income in USD 10,000.
employed - Factor. Is the individual employed?
insurance- Factor. Is the individual covered by private insurance?
medicaid Factor - Is the individual covered by Medicaid?
Now, we’ll fit a hurdle model using hurdle() function of {pscl} package. We’ll model the binary part (whether or not the person visited the doctor) and the count part (how many visits, given that they visited). We will use dist = "poisson" for the data with Piosson distribution.
Code
# Fit the hurdle model using the pscl packagefit.hurdle.pois <-hurdle( visits ~ hospital + health + chronic + age + afam + gender + married + school + income + employed + insurance + medicaid,data = train, zero.dist ="binomial",dist ="poisson")# Summary of the hurdle modelsummary(fit.hurdle.pois)
Call:
hurdle(formula = visits ~ hospital + health + chronic + age + afam +
gender + married + school + income + employed + insurance + medicaid,
data = train, dist = "poisson", zero.dist = "binomial")
Pearson residuals:
Min 1Q Median 3Q Max
-4.5649 -1.1383 -0.4630 0.5496 19.3241
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.920082 0.106057 18.104 < 2e-16 ***
hospital 0.152069 0.007313 20.794 < 2e-16 ***
healthpoor 0.214664 0.021576 9.949 < 2e-16 ***
healthexcellent -0.361636 0.038574 -9.375 < 2e-16 ***
chronic 0.091637 0.005860 15.637 < 2e-16 ***
age -0.067319 0.012989 -5.183 2.19e-07 ***
afamyes -0.013614 0.027013 -0.504 0.61428
gendermale -0.057532 0.017816 -3.229 0.00124 **
marriedyes -0.074486 0.018247 -4.082 4.46e-05 ***
school 0.016566 0.002357 7.028 2.09e-12 ***
income -0.001941 0.002886 -0.672 0.50131
employedyes -0.018341 0.027876 -0.658 0.51056
insuranceyes 0.158768 0.024306 6.532 6.48e-11 ***
medicaidyes 0.211406 0.030955 6.830 8.52e-12 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.562364 0.700684 -2.230 0.02576 *
hospital 0.345241 0.115328 2.994 0.00276 **
healthpoor -0.117102 0.190984 -0.613 0.53977
healthexcellent -0.330736 0.169567 -1.950 0.05112 .
chronic 0.587699 0.055765 10.539 < 2e-16 ***
age 0.186709 0.088034 2.121 0.03393 *
afamyes -0.349520 0.154264 -2.266 0.02347 *
gendermale -0.452056 0.115489 -3.914 9.07e-05 ***
marriedyes 0.208622 0.122905 1.697 0.08962 .
school 0.060100 0.014993 4.009 6.11e-05 ***
income 0.046390 0.025866 1.793 0.07290 .
employedyes -0.006166 0.172204 -0.036 0.97144
insuranceyes 0.673297 0.132793 5.070 3.97e-07 ***
medicaidyes 0.452972 0.202256 2.240 0.02512 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 20
Log-likelihood: -1.099e+04 on 28 Df
Alternatively, you could use negative binomial distribution if overdispersion is present. Here use dist = "negbin" if you expect overdispersion in the count data.
Code
# Fit the hurdle model using the pscl packagefit.hurdle.nb <-hurdle( visits ~ hospital + health + chronic + age + afam + gender + married + school + income + employed + insurance + medicaid,data = train, zero.dist ="binomial",dist ="negbin")summary(fit.hurdle.nb )
Call:
hurdle(formula = visits ~ hospital + health + chronic + age + afam +
gender + married + school + income + employed + insurance + medicaid,
data = train, dist = "negbin", zero.dist = "binomial")
Pearson residuals:
Min 1Q Median 3Q Max
-1.1992 -0.7122 -0.2742 0.3281 13.3177
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.7525249 0.2597264 6.748 1.50e-11 ***
hospital 0.2097844 0.0254244 8.251 < 2e-16 ***
healthpoor 0.2715317 0.0562376 4.828 1.38e-06 ***
healthexcellent -0.4137847 0.0785668 -5.267 1.39e-07 ***
chronic 0.1127509 0.0148089 7.614 2.66e-14 ***
age -0.0688877 0.0321429 -2.143 0.032100 *
afamyes -0.0264495 0.0652791 -0.405 0.685348
gendermale -0.0520124 0.0428769 -1.213 0.225105
marriedyes -0.1016623 0.0446715 -2.276 0.022859 *
school 0.0181822 0.0054619 3.329 0.000872 ***
income -0.0006644 0.0071340 -0.093 0.925799
employedyes -0.0112664 0.0652961 -0.173 0.863011
insuranceyes 0.1701965 0.0567624 2.998 0.002714 **
medicaidyes 0.2104205 0.0769961 2.733 0.006278 **
Log(theta) 0.3910325 0.0507931 7.699 1.38e-14 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.562364 0.700684 -2.230 0.02576 *
hospital 0.345241 0.115328 2.994 0.00276 **
healthpoor -0.117102 0.190984 -0.613 0.53977
healthexcellent -0.330736 0.169567 -1.950 0.05112 .
chronic 0.587699 0.055765 10.539 < 2e-16 ***
age 0.186709 0.088034 2.121 0.03393 *
afamyes -0.349520 0.154264 -2.266 0.02347 *
gendermale -0.452056 0.115489 -3.914 9.07e-05 ***
marriedyes 0.208622 0.122905 1.697 0.08962 .
school 0.060100 0.014993 4.009 6.11e-05 ***
income 0.046390 0.025866 1.793 0.07290 .
employedyes -0.006166 0.172204 -0.036 0.97144
insuranceyes 0.673297 0.132793 5.070 3.97e-07 ***
medicaidyes 0.452972 0.202256 2.240 0.02512 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 1.4785
Number of iterations in BFGS optimization: 22
Log-likelihood: -8394 on 29 Df
The binary part of the hurdle model estimates the factors associated with the likelihood of visiting the doctor at all.
The count part estimates the factors associated with the number of doctor visits for those who visited at least once.
You can inspect the coefficients and interpret their significance:
A positive coefficient in the binary part (logistic regression) means the predictor increases the likelihood of visiting the doctor.
A positive coefficient in the count part means the predictor increases the number of visits among those who visit the doctor.
We know that each component of a hurdle model can have different sets of predictors. We can do this in the hurdle(0) function by using “|” in the model formula. For example, let’s say we want to fit the zero hurdle component using only the insurance and gender predictors. We can do that as follows:
Model below fist the count data model (visits regressed on all other variables) conditional on the zero hurdle model (visits regressed on gender and insurance).
Code
# Fit the hurdle model using the pscl packagefit.hurdle.nb.01<-hurdle( visits ~ . | gender + insurance, data = train, zero.dist ="binomial",dist ="negbin")summary(fit.hurdle.nb.01)
Call:
hurdle(formula = visits ~ . | gender + insurance, data = train, dist = "negbin",
zero.dist = "binomial")
Pearson residuals:
Min 1Q Median 3Q Max
-1.1094 -0.7330 -0.2616 0.3502 13.2554
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.7525249 0.2597264 6.748 1.50e-11 ***
hospital 0.2097844 0.0254244 8.251 < 2e-16 ***
healthpoor 0.2715317 0.0562376 4.828 1.38e-06 ***
healthexcellent -0.4137847 0.0785668 -5.267 1.39e-07 ***
chronic 0.1127509 0.0148089 7.614 2.66e-14 ***
age -0.0688877 0.0321429 -2.143 0.032100 *
afamyes -0.0264495 0.0652791 -0.405 0.685348
gendermale -0.0520124 0.0428769 -1.213 0.225105
marriedyes -0.1016623 0.0446715 -2.276 0.022859 *
school 0.0181822 0.0054619 3.329 0.000872 ***
income -0.0006644 0.0071340 -0.093 0.925799
employedyes -0.0112664 0.0652961 -0.173 0.863011
insuranceyes 0.1701965 0.0567624 2.998 0.002714 **
medicaidyes 0.2104205 0.0769961 2.733 0.006278 **
Log(theta) 0.3910325 0.0507931 7.699 1.38e-14 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.28807 0.09831 13.102 < 2e-16 ***
gendermale -0.40401 0.10023 -4.031 5.56e-05 ***
insuranceyes 0.76312 0.10774 7.083 1.41e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 1.4785
Number of iterations in BFGS optimization: 22
Log-likelihood: -8512 on 18 Df
cat("Degrees of Freedom Difference:", df_diff, "\n")
Degrees of Freedom Difference: 1
Code
cat("p-value:", p_value, "\n")
p-value: 0
Vuong test
Code
vuong(fit.hurdle.nb,fit.hurdle.pois)
Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic H_A p-value
Raw 11.86906 model1 > model2 < 2.22e-16
AIC-corrected 11.86906 model1 > model2 < 2.22e-16
BIC-corrected 11.86906 model1 > model2 < 2.22e-16
Prediction Performance
The predict() function will be used to predict the number of diabetes patients the test counties. This will help to validate the accuracy of the these regression models.
Code
# Predictiontest$Pred.hurdle.pois<-predict(fit.hurdle.pois, test, type ="response")test$Pred.hurdle.nb<-predict(fit.hurdle.nb, test, type ="response")# RMSErmse.pois<-Metrics::rmse(test$visits, test$Pred.hurdle.pois)rmse.nb<-Metrics::rmse(test$visits, test$Pred.hurdle.nb)cat("Hurdlermse Poisson:" ,rmse.pois, "\n")
Hurdlermse Poisson: 6.977813
Code
cat("Hurdlermse NB:", rmse.nb, "\n")
Hurdlermse NB: 7.040287
Summary and Conclusions
In this tutorial, we explored the hurdle model in R, a powerful tool for analyzing count data with an excess of zeros, commonly found in healthcare, economics, and other fields. We applied the hurdle model using both Poisson and negative binomial distributions to model two key processes: whether an individual visits the doctor at all and, if they do, how often they visit.
By fitting and interpreting the hurdle model in R, we’ve demonstrated how to address complex zero-inflated count data scenarios, providing a clearer understanding of the relationships between predictors and outcomes. This method can be extended to other real-world problems where excess zeros and over-dispersion are prevalent.