6. Poisson Regression Models

6.1. Standard Poisson Regression (Count data)

6.2. Poisson Regression Model with Offset (Rate data)

6.3. Poisson Regression Models for Overdispersed Data

6.4. Zero-Inflated Models

6.5. Hurdle Model

Introduction to Poisson Regression

Poisson regression is a statistical method that is utilized for modeling count data. It is a powerful technique particularly useful when dealing with data that follows a Poisson distribution, which is a probability distribution that describes the occurrence of events over a fixed time or space interval.

This method is especially useful when the dependent variable represents the total number of occurrences of a specific event within a fixed time or space period. For instance, it can be used to model the number of defects in a manufacturing process, the number of accidents in a factory, or the number of calls received by a customer service center.

Poisson regression is based on the assumption that the number of occurrences of an event is independent of the occurrence of other events and that the rate of occurrence is constant over time or space. It is a valuable tool for identifying factors that influence the frequency of an event and predicting future occurrences of the event.

In Poisson regression, the relationship between the predictor variables (also known as independent variables or features) and the outcome variable (count data) is modeled using the Poisson distribution. The goal is to estimate the parameters of the model that best describe how changes in the predictor variables influence the rate of occurrence of the event.

The Poisson regression model assumes that the mean and variance of the count variable are equal, which is a characteristic of the Poisson distribution. The model is typically expressed as:

\[ \log(\mu) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_px_p \]

Where:

  • \(\mu\) is the mean of the Poisson distribution, which represents the expected count of the event.

  • \(\beta_0, \beta_1, \ldots, \beta_p\) are the coefficients of the predictor variables \(x_1, x_2, \ldots, x_p\) respectively.

  • \(x_1, x_2, \ldots, x_p\) are the predictor variables.

  • \(p\) is the number of predictor variables. The link function used in Poisson regression is the natural logarithm (log) function. This link function ensures that the predicted values are positive, which is suitable for count data.

There are two types of Poisson regression: one that models the counts of events directly, and one that models rates data with an offset/exposure variable.

When directly modeling counts data, Poisson regression considers only the number of events and does not take into account any additional exposure or observation time. On the other hand, when modeling rates data, Poisson regression considers an offset variable, which represents the time, area, population or other measure over which the events are observed. This offset variable adjusts for the differing amounts of exposure in different observations. In epidemiological studies, the offset variable might represent the natural logarithm of the person-time at risk for each observation. This allows the model to estimate the rate of occurrence of events (e.g., disease cases) per unit of person-time, rather than simply the count of events.

The difference between Poisson regression with counts and rates data:

Poisson Regression with Counts Data:

  • In this scenario, the outcome variable represents the raw counts of events occurring within a fixed unit of observation (e.g., number of accidents per day).

  • The predictor variables are typically factors that influence the count of events (e.g., time of day, weather conditions, location).

  • The interpretation of coefficients in this case relates to the effect of predictors on the count of events directly.

Poisson Regression with Rates :

  • Here, the outcome variable represents the count of events normalized by some measure of exposure, such as time, area, or population size (e.g., number of accidents per 1,000 miles driven, number of infections per 100,000 population).

  • The predictor variables still represent factors that influence the count of events, but now the outcome is adjusted for the measure of exposure.

  • Coefficients in this case are interpreted as the effect of predictors on the rate of events (i.e., the number of events per unit of exposure).

Poisson regression can be extended to handle overdispersion (when the variance of the count data is greater than the mean), through methods such as quasi-Poisson regression or negative binomial regression. These extensions allow for more flexibility in modeling count data when the assumption of equidispersion (equal mean and variance) is violated.

While standard Poisson regression is the most basic form, several variations have been developed to handle specific data characteristics, such as overdispersion and excess zeros.

Types of Poisson regression models

Poisson regression models are commonly used for count data, where the response variable represents counts of events. Here’s an overview of different types of Poisson regression models:

1. Standard Poisson Regression

  • Purpose: To model count data as a function of predictor variables.
  • Assumption: Assumes the response variable follows a Poisson distribution, with the mean equal to the variance. The mean is modeled as a log-linear function of the predictors.
  • Use Case: When counts are non-negative integers (e.g., number of calls to a call center, number of hospital admissions) and variance is approximately equal to the mean.

2. Poisson Regression with Offset

  • Purpose: When directly modeling counts data, Poisson regression considers only the number of events and does not take into account any additional exposure or observation time. On the other hand, when modeling rates data, Poisson regression considers an offset variable, which represents the time, area, population or other measure over which the events are observed. T
  • Description: Used when the response variable is a rate (e.g., the number of events per unit of time or space). An offset term is added to adjust for the exposure variable.
  • Use case: If you are comparing the number of accidents across different cities, you may want to adjust for the population size of each city as a measure of exposure. This will ensure that the comparison is fair and that you are not simply comparing the raw numbers of accidents, which could be skewed by differences in the size of each city.

3. Quasi-Poisson Regression

  • Purpose: Another approach to handling overdispersion in Poisson models.
  • Model Form: Similar to standard Poisson regression but estimates an extra dispersion parameter.
  • Use Case: When overdispersion is present, and the goal is to adjust standard errors without changing the model structure.

4. Negative Binomial Regression

  • Purpose: A more robust alternative to Poisson regression that explicitly models overdispersion.
  • Model Form: Extends Poisson regression by introducing an additional parameter to account for overdispersion.
  • Use Case: When overdispersion is significant, as it’s often more reliable than a quasi-Poisson model. Commonly used in areas like ecology and epidemiology, where count data frequently exhibit overdispersion.

5. Zero-Inflated Poisson (ZIP) Regression

  • Purpose: To handle excess zeros in count data, which can lead to poor performance for standard Poisson models.
  • Model Form: Combines a Poisson model with a separate process that generates zero counts. Uses a mixture of two processes: (1) a count process (Poisson) and (2) a point mass at zero.
  • Use Case: When there are more zero counts than expected. For example, data on the number of insurance claims might include a high number of people with zero claims.

6. Zero-Inflated Negative Binomial (ZINB) Regression

  • Purpose: A variant of the zero-inflated model that uses a Negative Binomial distribution to handle both overdispersion and excess zeros.
  • Model Form: Similar to ZIP but allows for overdispersion in addition to excess zeros.
  • Use Case: Ideal for highly overdispersed data with many zeros, such as customer purchase counts where many individuals make no purchases at all.

7. Hurdle Poisson Model

  • Purpose: Similar to zero-inflated models, but assumes a two-part process: one for zeros and another for positive counts.
  • Model Form: Separates the zero and positive counts into two models. The first model (binary) predicts the probability of zero vs. non-zero counts. The second model (truncated Poisson or Negative Binomial) models positive counts only.
  • Use Case: When zeros and positive counts need to be modeled separately, such as in consumer behavior where some people never buy a product, while others buy varying amounts.

Each of these models has specific applications based on the properties of the count data, particularly concerning dispersion and zero counts. Standard Poisson regression is useful for basic count data, while other models address overdispersion and zero inflation, which are common in many practical settings.

Summary of Poisson Regression Variants:

Model Handles Overdispersion Handles Excess Zeros Used for
Standard Poisson No No Basic count data
Poisson with Offset No No Rates (counts per exposure)
Zero-Inflated Poisson (ZIP) No Yes Count data with excess zeros
Negative Binomial Yes No Overdispersed count data
Hurdle Model Yes Yes Count data with excess zeros and overdispersion
Quasi-Poisson Yes No Overdispersed count data (simpler than NB)
Generalized Poisson Yes (and underdispersion) No Both overdispersed and underdispersed data

These models help address the limitations of standard Poisson regression in various practical situations, providing more accurate results when data deviate from the assumptions of the basic Poisson model.

References

  1. Chapter 4 Poisson Regression

  2. Poisson Regression

  3. Poisson Regression / Regression of Counts: Definition

Books

  1. “Modeling Count Data”Joseph M. Hilbe
    • A focused discussion on modeling count data, including Poisson regression.
  2. “Zero-Inflated Models and Generalized Linear Mixed Models with R”Alain F. Zuur and Elena N. Ieno
    • Covers Zero-Inflated Poisson models and generalized mixed models.
  3. “Count Data Models with R”John M. Hilbe
    • A detailed guide to count data models, including Hurdle and Zero-Inflated models.
  4. “Statistical Methods for Rates and Proportions”Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik
    • Includes discussions on Poisson and binomial regression methods.
  5. “Introduction to Probability Models”Sheldon M. Ross
    • Covers the Poisson process and other probability models.