6.3. Poisson Regression Models for Overdispersed Data

When analyzing count data, you may encounter overdispersion, which occurs when the observed variability in the data is more significant than what a standard Poisson model would predict. Overdispersion often arises in real-world datasets because natural processes can introduce more variability than a simple Poisson model can accommodate. Ignoring overdispersion can lead to underestimated standard errors, unreliable confidence intervals, and misleading conclusions.

In this tutorial, we will explore methods for modeling overdispersed count data in R, focusing on approaches such as quasi-Poisson and negative binomial models, which can manage the extra variability in the data. We will begin by discussing the implications of overdispersion, then move on to fitting models from scratch and using R’s glm() function. After that, we’ll examine model diagnostics to detect overdispersion, evaluate the models to ensure a proper fit, and interpret incidence rate ratios (IRRs).

Overview

Poisson regression is used to model count data where the response variable represents the number of occurrences of an event. A key assumption of the Poisson model is that the mean and the variance of the response variable are equal:

\[ \text{Var}(Y) = \mu \]

Where \((Y)\) is the count response variable, and \(\mu\) is the mean (expected value) of the counts. However, in many real-world datasets, this assumption is often violated. When the variance is greater than the mean, the data exhibit overdispersion. Overdispersion can lead to underestimated standard errors and, as a result, inflated Type I error rates (i.e., falsely concluding that a predictor is significant).

Causes of Overdispersion

Overdispersion in count data can occur for several reasons, including:

  1. Unobserved heterogeneity: There might be additional variables or factors influencing the count data that are not included in the model.

  2. Zero-inflation: There might be an excess of zeros in the data compared to what the Poisson model predicts.

  3. Clustering: Observations within clusters may be more similar than those across clusters, leading to greater variability than assumed.

Identifying Overdispersion

  • The variance is much greater than the mean, which suggests that we will have over-dispersion in the model.

  • Deviance: If the deviance is much larger than the degrees of freedom, it suggests overdispersion. We can estimate a dispersion parameter, \(ϕ\) by dividing the model deviance by its corresponding degrees of freedom; i.e.,

\[ ϕ^2 = \frac{\sum Residuals^2}{n-p} \]

  • If this statistic is significantly greater than 1, it suggests the presence of overdispersion.
  • You can also examine the Pearson Chi-Square statistic divided by the degrees of freedom as another indicator of overdispersion.

Handling Overdispersion in Poisson Regression

When overdispersion is present, Poisson regression is not appropriate, and alternative models are needed. There are several approaches to address overdispersion:

  1. Quasi-Poisson Regression: A simple extension where the Poisson model is adjusted to allow for overdispersion by adding a dispersion parameter.
  2. Negative Binomial Regression: A more flexible model that explicitly accounts for overdispersion by allowing the variance to differ from the mean.
  3. Zero-Inflated Models: Useful when the overdispersion is caused by excess zeros in the data.

Quasi-Poisson model

Quasi-Poisson regression addresses the issue of overdispersion by allowing the variance to be a multiple of the mean, rather than strictly equal to the mean in traditional Poisson regression. This multiple, known as the dispersion parameter, is estimated from the data. By relaxing the assumption of equal mean and variance, quasi-Poisson regression can provide a more flexible model that better fits count data with overdispersion.

The quasi-Poisson model is a generalized linear model used for count data that assumes the variance of the response variable is proportional to its mean. It’s particularly useful when dealing with count data that exhibits overdispersion, where the variance is greater than the mean.

Let’s denote the response variable as \(Y\), the predictor variables as \(X_1, X_2, ..., X_p\), and the parameters of the model as \(\beta_0, \beta_1, ..., \beta_p\). The model can be represented as:

\[ Y \sim \text{Poisson}(\lambda) \]

where \(\lambda\) is the mean parameter of the Poisson distribution, given by:

\[ \lambda = \exp(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p) \]

In the quasi-Poisson model, the variance of \(Y\) is allowed to be a multiple \(\phi\) of the mean, so the variance function is given by:

\[ \text{Var}(Y) = \phi \lambda \]

where \(\phi\) is the dispersion parameter. When \(\phi = 1\), the quasi-Poisson model reduces to a standard Poisson model. When \(\phi > 1\), it indicates overdispersion, and when \(\phi < 1\), it indicates underdispersion.

The estimation of parameters \(\beta_0, \beta_1, ..., \beta_p\) in a quasi-Poisson model is typically done using maximum likelihood estimation (MLE). Maximum Likelihood Estimation (MLE) is a method used for estimating the parameters of a statistical model. It’s based on the principle of maximizing the likelihood function, which represents the probability of observing the given data given the parameters of the model.Additionally, the dispersion parameter \(\phi\) can also be estimated from the data.

Quasi-Poisson regression is particularly useful for count data where there may be unobserved or unaccounted-for sources of variation leading to greater variability than expected under a strict Poisson distribution. It is commonly used in fields such as epidemiology, ecology, and social sciences where count data is prevalent..

Below is a summary of the comparison between Poisson inference, which assumes no overdispersion, and quasi-Poisson inference, which accounts for overdispersion, in terms of tests and confidence intervals.

Comparison of Poisson and quasi-Poisson inference. (Source: Paul Roback and Julie Legler, 2021, Chapter 4 Poisson Regression. Beyond Multiple Linear Regression).

To fit a Quasi-Poisson model manually without using glm() function in R, you can follow will steps below:

  • Create a sample dataset with overdispersion
  • Fit a Poisson model
  • Estimate overdispersion parameter
  • Adjusting Standard Errors, t-test and Confidence Intervals

Generate a Sample Data

To create overdispersed count data for a Poisson regression model with 4 covariates and an offset, we can start with a basic Poisson model, then introduce extra variability by adding random noise from a Gamma distribution (often used to simulate overdispersion).

Code
# set seed
set.seed(42)  # for reproducibility

# Number of observations
n <- 500

# Generate covariates
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)

# Generate offset variable (positive values, often log-transformed)
offset_var <- log(1 + abs(rnorm(n)))

# Define the true coefficients for each covariate and the intercept
beta <- c(0.5, -0.3, 0.2, 0.1, 0.7)  # intercept and 4 coefficients

# Calculate the linear predictor (eta) and expected Poisson mean
eta <- beta[1] + beta[2] * x1 + beta[3] * x2 + beta[4] * x3 + beta[5] * x4 + offset_var
mu <- exp(eta)  # Poisson mean

# Add overdispersion by using a Gamma distribution to introduce extra variation
# Gamma mean = 1, shape parameter controls the level of overdispersion
# Smaller shape (e.g., < 1) introduces more overdispersion
shape_param <- 0.5
overdispersion_noise <- rgamma(n, shape = shape_param, scale = 1 / shape_param)

# Multiply the Poisson mean by the overdispersion noise to get the final mean
y_overdispersed <- rpois(n, lambda = mu * overdispersion_noise)

# Compile into a data frame
data <- data.frame(y = y_overdispersed, x1 = x1, x2 = x2, x3 = x3, x4 = x4, offset_var = offset_var)

# View the structure and first few rows of the data to check
head(data)
  y         x1           x2         x3         x4 offset_var
1 0  1.3709584  1.029140719  2.3250585 -0.6013830  0.2236059
2 3 -0.5646982  0.914774868  0.5241222 -0.1358161  0.2452369
3 0  0.3631284 -0.002456267  0.9707334 -0.9872728  1.0023714
4 0  0.6328626  0.136009552  0.3769734  0.8319250  1.1008448
5 0  0.4042683 -0.720153545 -0.9959334 -0.7950595  0.8293412
6 0 -0.1061245 -0.198124330 -0.5974829  0.3404646  0.3117683

Fit a Poisson Model with Offset

Define the Poisson log-likelihood function and maximize it with optim()

Code
# Log-likelihood function for Poisson regression with offset
poisson_log_likelihood_offset <- function(params) {
  beta0 <- params[1]
  beta1 <- params[2]
  beta2 <- params[3]
  beta3 <- params[4]
  beta4 <- params[5]
  
  # Compute lambda (expected counts)
  lambda <- exp(beta0 + beta1 * data$x1 + beta2 * data$x2 + beta3 * data$x3 + beta4 * data$x4 + data$offset)
  
  # Log-likelihood for Poisson with offset
  log_likelihood <- sum(dpois(data$y, lambda, log = TRUE))
  
  return(-log_likelihood)  # Negative for minimization
}

# Initial guesses for beta coefficients
initial_params <- rep(0, 5)

# Use optim to find MLEs for the coefficients
fit <- optim(par = initial_params, fn = poisson_log_likelihood_offset, hessian = TRUE)

# Extract parameter estimates and standard errors
coefficients <- fit$par
std_errors <- sqrt(diag(solve(fit$hessian)))

# Calculate Z-scores and p-values
z_scores <- coefficients / std_errors
p_values <- 2 * (1 - pnorm(abs(z_scores)))

# Create summary table
summary_table <- data.frame(
  Coefficient = coefficients,
  Std_Error = std_errors,
  Z_value = z_scores,
  P_value = p_values
)

# Assign row names for clarity
row.names(summary_table) <- c("Intercept", "x1", "x2", "x3", "x4")

# Display summary table
print(summary_table)
          Coefficient  Std_Error    Z_value      P_value
Intercept   0.4648224 0.02897226  16.043705 0.000000e+00
x1         -0.3138090 0.02326024 -13.491220 0.000000e+00
x2          0.2589814 0.02190686  11.821935 0.000000e+00
x3          0.1763499 0.02448229   7.203165 5.884182e-13
x4          0.6133713 0.02316479  26.478603 0.000000e+00

Estimate Overdispersion Parameter

In a Quasi-Poisson model, overdispersion can be accounted for by using the estimated dispersion parameter to adjust model fit statistics, including the standard errors of the coefficient estimates, hypothesis tests, and other inferential measures. Here’s a detailed explanation of how this works:

Code
# estimaates
estimates <- fit$par
names(estimates ) <- c("beta0", "beta1", "beta2", "beta3", "beta4")
# Calculate fitted values
lambda_hat <- exp(estimates[1] + estimates[2] * data$x1 + estimates[3] * data$x2 +
                  estimates[4] * data$x3 + estimates[5] * data$x4 + data$offset)


# Calculate Pearson residuals
pearson_residuals <- (data$y - lambda_hat) / sqrt(lambda_hat)

# Calculate dispersion statistic
dispersion_statistic <- sum(pearson_residuals^2) / (n - length(estimates))

# Display dispersion statistic
cat("Dispersion Statistic:", dispersion_statistic, "\n")
Dispersion Statistic: 8.619667 
Code
# If the dispersion statistic is much greater than 1, overdispersion is present
if (dispersion_statistic > 1.2) {
  cat("Warning: Evidence of overdispersion\n")
} else {
  cat("No evidence of overdispersion\n")
}
Warning: Evidence of overdispersion

Adjusting Standard Errors, t-test and Confidence Intervals

When overdispersion is present, the standard errors calculated in a Poisson model underestimate the variability in the data. In the Quasi-Poisson approach, the standard errors are adjusted by multiplying by the square root of the dispersion parameter, \(\phi\) (also called the quasi-likelihood variance or scale parameter).

If the dispersion parameter, \(\phi\), is estimated as:

\[ \phi = \frac{\sum (y_i - \hat{y}_i)^2 / \hat{y}_i}{n - p} \]

where: - \(y_i\) is the observed response, - \(\hat{y}_i\) is the fitted response, - \(n\) is the number of observations, and - \(p\) is the number of estimated parameters in the model.

Then, the adjusted standard error of each coefficient, \(SE_{\text{adjusted}}\), is calculated as:

\[ SE_{\text{adjusted}} = SE_{\text{Poisson}} \times \sqrt{\phi}\]

where \(SE_{\text{Poisson}}\) is the standard error from the original Poisson model.

With the adjusted standard errors, hypothesis tests for coefficients (using \(t\)-tests) and confidence intervals are also adjusted. For each coefficient ( _j ):

  • Adjusted \(t\)-statistic: The \(t\)-statistic is calculated as \(t_j = \frac{\hat{\beta_j}}{SE_{\text{adjusted}}}\).

  • Confidence Intervals: Adjusted confidence intervals are calculated as:

    \[ \text{CI}_{j} = \hat{\beta_j} \pm t_{\alpha/2, \, \text{df}} \times SE_{\text{adjusted}} \]

where \(t_{\alpha/2, \, \text{df}}\) is the critical \(t\)-value for the desired confidence level and degrees of freedom.

Adjusting Model Fit Statistics (e.g., Deviance and Residual Deviance)

For model comparison, such as comparing deviances between models or calculating residual deviance:

  • The Quasi-Poisson deviance can be approximated as the residual deviance from the Poisson model divided by the estimated dispersion, \(\phi\):

    \[ \text{Quasi-Poisson Deviance} = \frac{\text{Poisson Residual Deviance}}{\phi} \]

We will can manually adjust standard errors, hypothesis tests, and confidence intervals.

Code
# Poisson model coefficients and standard errors
poisson_coef <- fit$par
poisson_se<- sqrt(diag(solve(fit$hessian)))


# Adjust standard errors by the square root of dispersion
adjusted_se <- poisson_se * sqrt(dispersion_statistic )

# Adjusted t-values
t_values <- poisson_coef / adjusted_se

# Confidence intervals
alpha <- 0.05
critical_value <- qt(1 - alpha / 2, df = n - length(poisson_coef))
conf_intervals <- cbind(poisson_coef - critical_value * adjusted_se, 
                        poisson_coef + critical_value * adjusted_se)

# Output results
data.frame(Coefficients = poisson_coef, 
           `Adjusted SE` = adjusted_se, 
           `t-value` = t_values,
           `CI Lower` = conf_intervals[,1],
           `CI Upper` = conf_intervals[,2])
  Coefficients Adjusted.SE   t.value    CI.Lower   CI.Upper
1    0.4648224  0.08506044  5.464613  0.29769836  0.6319464
2   -0.3138090  0.06829036 -4.595217 -0.44798370 -0.1796343
3    0.2589814  0.06431694  4.026645  0.13261359  0.3853493
4    0.1763499  0.07187820  2.453455  0.03512594  0.3175739
5    0.6133713  0.06801014  9.018822  0.47974716  0.7469954

This table provides adjusted standard errors, \(t\)-values, and confidence intervals for the coefficients, accounting for overdispersion in the Quasi-Poisson model.

Negative Binomial Regression model

When dealing with overdispersion, one alternative approach is to use a negative binomial model instead of a Poisson distribution. This approach introduces an additional parameter in addition to the mean parameter, \(λ\), which gives the model more flexibility to handle a wider range of count data. This means that the negative binomial model can accommodate situations where the variance is greater than the mean, which is often the case in count data.

In contrast to the quasi-Poisson model, the negative binomial model assumes an explicit likelihood function or simply likelihood function that explicitly models the extra variation in the data. Negative binomial random variables are discrete probability distributions that are useful for modeling count data since they only take on non-negative integer values. This model posits that a \(λ\) value is selected for each institution, which is then used to generate a count using a Poisson random variable with the selected “λ”.

It relaxes the assumption that the variance equals the mean, allowing the variance to exceed the mean by a function of a dispersion parameter \(\theta\):

\[ \text{Var}(Y) = \mu + \frac{\mu^2}{\theta} \]

Where:

  • \(\theta\) is the dispersion parameter, estimated from the data.
  • As \(\theta \to \infty\), the model reduces to the standard Poisson regression.

The advantage of using a negative binomial model is that it can handle overdispersed data more effectively than a Poisson model alone. When using a Poisson model, the variance of the data is equal to the mean, which is often not the case in count data. The negative binomial model, on the other hand, can accommodate a wider range of variance-to-mean ratios, allowing it to fit the data more accurately. By using this approach, the counts will be more dispersed than what would be expected for observations based on a single Poisson variable with a rate of \(λ\).

For the Negative Binomial (NB) model, we need to estimate both regression coefficients and the dispersion parameter \(𝜃\) simultaneously. Define the log-likelihood function accordingly. Here below R code to fit a NB without any R Package:

Code
# 1. Define the Negative Binomial Log-Likelihood Function
neg_binom_loglik <- function(params) {
  intercept <- params[1]
  beta <- params[2:5]
  theta <- exp(params[6])  # Dispersion parameter (exp to ensure positivity)
  
  linear_predictor <- intercept + beta[1] * data$x1 + beta[2] * data$x2 + beta[3] * data$x3 + beta[4] * data$x4 + data$offset_var
  mu <- exp(linear_predictor)  # Expected count
  
  # Negative Binomial log-likelihood
  ll <- lgamma(data$y + theta) - lgamma(theta) - lgamma(data$y + 1) +
        theta * log(theta) + data$y * log(mu) - (data$y + theta) * log(mu + theta)
  
  -sum(ll)  # Return the negative log-likelihood for minimization
}

# 2. Initial Parameter Estimates
init_params <- c(rep(0, 5), log(1))  # Starting values for intercept, betas, and log(theta)

# 3. Optimize the Negative Binomial Log-Likelihood
fit <- optim(par = init_params, fn = neg_binom_loglik, method = "BFGS", hessian = TRUE)
coef_estimates <- fit$par[1:5]  # Coefficients
theta_estimate <- exp(fit$par[6])  # Dispersion parameter

# Display results
cat("Coefficient Estimates (Intercept, x1, x2, x3, x4):\n", coef_estimates, "\n")
Coefficient Estimates (Intercept, x1, x2, x3, x4):
 0.4484757 -0.3681213 0.2171373 0.2306977 0.5874186 
Code
cat("Dispersion Parameter (Theta):\n", theta_estimate, "\n")
Dispersion Parameter (Theta):
 0.4156591 
Code
# 4. Calculate Standard Errors
# The standard errors are obtained from the Hessian (inverse of the matrix of second derivatives)
hessian_matrix <- fit$hessian
cov_matrix <- solve(hessian_matrix)  # Variance-covariance matrix
standard_errors <- sqrt(diag(cov_matrix))

# Display summary statistics
summary_stats <- data.frame(
  Coefficients = coef_estimates,
  `Std. Error` = standard_errors[1:5],
  `z value` = coef_estimates / standard_errors[1:5],
  `Pr(>|z|)` = 2 * pnorm(-abs(coef_estimates / standard_errors[1:5]))
)
theta_se <- standard_errors[6]  # Standard error for theta

cat("Summary Statistics for Negative Binomial Model:\n")
Summary Statistics for Negative Binomial Model:
Code
print(summary_stats)
  Coefficients Std..Error   z.value     Pr...z..
1    0.4484757 0.07638828  5.871002 4.331698e-09
2   -0.3681213 0.08596749 -4.282099 1.851387e-05
3    0.2171373 0.07345719  2.955970 3.116874e-03
4    0.2306977 0.07856487  2.936398 3.320479e-03
5    0.5874186 0.07566569  7.763341 8.272094e-15
Code
cat("Dispersion Parameter (Theta) Standard Error:\n", theta_se, "\n")
Dispersion Parameter (Theta) Standard Error:
 0.08914934 

Poisson Model With Overdispersion in R

In this section, we will fit a Poisson regression model to count data with overdispersion in R. We will use the glm() function to fit the model and then check for overdispersion using diagnostic tests. If overdispersion is present, we will explore alternative models such as the quasi-Poisson and negative binomial models to handle the extra variability in the data.

Install Required R Packages

Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:

Code
packages <- c('tidyverse',
     'plyr',
      'DataExplorer',
         'dlookr',
         'rstatix',
         'gtsummary',
         'performance',
         'jtools',
         'margins',
         'marginaleffects',
         'ggeffects',
         'patchwork',
         'Metrics',
         'ggpmisc',
         'epiDisplay',
         'sandwich',
       "AER"
          )
#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Load R-packages

Code
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))
Code
# Check loaded packages
cat("Successfully loaded packages:\n")
Successfully loaded packages:
Code
print(search()[grepl("package:", search())])
 [1] "package:AER"             "package:lmtest"         
 [3] "package:zoo"             "package:car"            
 [5] "package:carData"         "package:sandwich"       
 [7] "package:epiDisplay"      "package:nnet"           
 [9] "package:MASS"            "package:survival"       
[11] "package:foreign"         "package:ggpmisc"        
[13] "package:ggpp"            "package:Metrics"        
[15] "package:patchwork"       "package:ggeffects"      
[17] "package:marginaleffects" "package:margins"        
[19] "package:jtools"          "package:performance"    
[21] "package:gtsummary"       "package:rstatix"        
[23] "package:dlookr"          "package:DataExplorer"   
[25] "package:plyr"            "package:lubridate"      
[27] "package:forcats"         "package:stringr"        
[29] "package:dplyr"           "package:purrr"          
[31] "package:readr"           "package:tidyr"          
[33] "package:tibble"          "package:ggplot2"        
[35] "package:tidyverse"       "package:stats"          
[37] "package:graphics"        "package:grDevices"      
[39] "package:utils"           "package:datasets"       
[41] "package:methods"         "package:base"           

The County-level age-adjusted number and rate of diabetes patients, prevalence of obesity, physical inactivity and Food environment index for the year 2016-2020 were obtained from United States Diabetes Surveillance System (USDSS).

We will use read_csv() function of {readr} package to import data as a tidy data.

Code
# load data
mf<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/Regression_analysis/county_data_2016_2020.csv") 
Rows: 3107 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): State, County, Urban_Rural
dbl (12): FIPS, X, Y, POP_Total, Diabetes_count, Diabetes_per, Obesity, Acce...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# select variables
df<-mf |> 
  dplyr::select(Diabetes_per,
                POP_Total,
                Obesity,
                Physical_Inactivity, 
                Access_Excercise,
                Food_Env_Index,
                SVI,
                Urban_Rural 
                ) |> 
  glimpse()
Rows: 3,107
Columns: 8
$ Diabetes_per        <dbl> 9.24, 8.48, 11.72, 10.08, 10.26, 9.06, 11.80, 13.2…
$ POP_Total           <dbl> 55707.0, 218346.8, 25078.2, 22448.2, 57852.4, 1017…
$ Obesity             <dbl> 29.22, 28.94, 29.34, 29.44, 30.10, 19.86, 30.38, 3…
$ Physical_Inactivity <dbl> 26.42, 22.86, 23.72, 25.38, 24.76, 18.58, 28.66, 2…
$ Access_Excercise    <dbl> 70.8, 72.2, 49.8, 30.6, 24.6, 19.6, 48.0, 51.4, 62…
$ Food_Env_Index      <dbl> 6.9, 7.7, 5.5, 7.6, 8.1, 4.3, 6.5, 6.3, 6.4, 7.7, …
$ SVI                 <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.8310, 0.…
$ Urban_Rural         <chr> "Urban", "Urban", "Rural", "Urban", "Urban", "Rura…
Code
# data processing
df$Diabetes_per<-as.integer(df$Diabetes_per)  
df$Urban_Rural<-as.factor(df$Urban_Rural)
# split  data
seeds = 11076
tr_prop = 0.70

df$log_POP_Total<-log(df$POP_Total)
# training data (70% data)
train= ddply(df,.(Urban_Rural),
                 function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test = ddply(df, .(Urban_Rural),
            function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)

Check Mean and Variance

To check for overdispersion or underdispersion in count data, you can compare the variance and mean of the response variable (i.e., the count data). Here’s how to check for overdispersion and underdispersion:

  • Poisson Distribution Assumption: In a standard Poisson model, the mean and variance of the count variable are equal. That is:

    \[ E(Y) = \text{mean} = \lambda, \quad \text{Var}(Y) = \lambda \]

  • Overdispersion: Occurs when the variance is greater than the mean: \[ \text{Var}(Y) > E(Y) \]

  • Underdispersion: Occurs when the variance is smaller than the mean: \[ text{Var}(Y) < E(Y) \]

Code
# Check the mean and variance
mean_count <- mean(mf$Diabetes_per)
var_count <- var(mf$Diabetes_per)

cat("Mean:", mean_count, "\n")
Mean: 8.596611 
Code
cat("Variance:", var_count, "\n")
Variance: 2.456342 
Code
# Compare mean and variance
if (var_count > mean_count) {
  cat("Overdispersion is present (Variance > Mean).\n")
} else if (var_count < mean_count) {
  cat("Underdispersion is present (Variance < Mean).\n")
} else {
  cat("No dispersion (Variance ≈ Mean).\n")
}
Underdispersion is present (Variance < Mean).

Fit a Poisson Model and Check Overdispersion

First we will a fit Poisson model with offset and then check overdispersion

Code
fit.pois <- glm(
            Diabetes_per  ~ 
                Obesity +
                Physical_Inactivity +
                Access_Excercise +
                Food_Env_Index +
                SVI +
                Urban_Rural+
                offset(log_POP_Total),
                family = poisson(link = "log"),
                data = train)

Check Overdispersion function of {AER} package:

Code
# Perform the dispersion test
dispersiontest(fit.pois)

    Overdispersion test

data:  fit.pois
z = 20.558, p-value < 2.2e-16
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  8.519496 

Check for Overdispersion using {performance} packages

Code
performance::check_overdispersion(fit.pois)
# Overdispersion test

       dispersion ratio =     9.842
  Pearson's Chi-Squared = 21318.216
                p-value =   < 0.001
Overdispersion detected.

Fit a Quasi-Pisson Model

We will fit a quasi-Poisson regression model using the glm() function. In the family argument, we specify quasi with the linkfunction as “log” and variance function as mu, indicating a logarithmic link function and using the variance equal to the mean (quasi-Poisson).

Code
fit.qpois <- glm(
            Diabetes_per  ~ 
                Obesity +
                Physical_Inactivity +
                Access_Excercise +
                Food_Env_Index +
                SVI +
                Urban_Rural+
                offset(log_POP_Total),
                family = quasi(link = "log", variance = "mu"),
                data = train)

Model Summary

Code
summary(fit.qpois)

Call:
glm(formula = Diabetes_per ~ Obesity + Physical_Inactivity + 
    Access_Excercise + Food_Env_Index + SVI + Urban_Rural + offset(log_POP_Total), 
    family = quasi(link = "log", variance = "mu"), data = train)

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -2.113798   0.276970  -7.632 3.44e-14 ***
Obesity             -0.024589   0.007997  -3.075  0.00213 ** 
Physical_Inactivity  0.023153   0.009059   2.556  0.01067 *  
Access_Excercise    -0.032914   0.001334 -24.681  < 2e-16 ***
Food_Env_Index      -0.383530   0.029474 -13.012  < 2e-16 ***
SVI                 -1.692113   0.109572 -15.443  < 2e-16 ***
Urban_RuralUrban    -1.063524   0.062451 -17.030  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasi family taken to be 9.842974)

    Null deviance: 42667  on 2172  degrees of freedom
Residual deviance: 13198  on 2166  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5
Code
jtools::summ(fit.qpois)
Observations 2173
Dependent variable Diabetes_per
Type Generalized linear model
Family quasi
Link log
𝛘²(6) 29468.92
p 0.00
Pseudo-R² (Cragg-Uhler) NA
Pseudo-R² (McFadden) NA
AIC NA
BIC NA
Est. S.E. t val. p
(Intercept) -2.11 0.28 -7.63 0.00
Obesity -0.02 0.01 -3.07 0.00
Physical_Inactivity 0.02 0.01 2.56 0.01
Access_Excercise -0.03 0.00 -24.68 0.00
Food_Env_Index -0.38 0.03 -13.01 0.00
SVI -1.69 0.11 -15.44 0.00
Urban_RuralUrban -1.06 0.06 -17.03 0.00
Standard errors: MLE

Model Performance

Code
performance::performance(fit.qpois)
# Indices of model performance

Nagelkerke's R2 |  RMSE | Sigma
-------------------------------
1.000           | 8.610 | 3.137

Incidence Rate Ratio (IRR)

Code
tidy(fit.qpois, exponentiate = TRUE, 
                       conf.int = TRUE)
# A tibble: 7 × 7
  term                estimate std.error statistic   p.value conf.low conf.high
  <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)            0.121   0.277       -7.63 3.44e- 14   0.0698     0.207
2 Obesity                0.976   0.00800     -3.07 2.13e-  3   0.961      0.991
3 Physical_Inactivity    1.02    0.00906      2.56 1.07e-  2   1.01       1.04 
4 Access_Excercise       0.968   0.00133    -24.7  1.01e-118   0.965      0.970
5 Food_Env_Index         0.681   0.0295     -13.0  2.52e- 37   0.644      0.722
6 SVI                    0.184   0.110      -15.4  4.03e- 51   0.149      0.228
7 Urban_RuralUrban       0.345   0.0625     -17.0  3.93e- 61   0.305      0.390
Code
gtsummary::tbl_regression(fit.qpois, exponentiate = TRUE)
Characteristic exp(Beta) 95% CI p-value
Obesity 0.98 0.96, 0.99 0.002
Physical_Inactivity 1.02 1.01, 1.04 0.011
Access_Excercise 0.97 0.97, 0.97 <0.001
Food_Env_Index 0.68 0.64, 0.72 <0.001
SVI 0.18 0.15, 0.23 <0.001
Urban_Rural


    Rural
    Urban 0.35 0.31, 0.39 <0.001
Abbreviation: CI = Confidence Interval

Marginal Effects and Adjusted Predictions

{ggeffects} supports labelled data and the plot()- method automatically sets titles, axis - and legend-labels depending on the value and variable labels of the data

Code
plot(predict_response(fit.qpois, terms = c("Obesity", "Urban_Rural"), condition = "log_POP_Total"), facets=TRUE)

effect_plot() function of {jtools} package plot simple effects in poisson regression models:

Code
p1<-jtools::effect_plot(fit.qpois, 
                    main.title = "Obesity", 
                    pred = Obesity, 
                    interval = T,  
                    outcome.scale = "link",
                    partial.residuals = F)
p2<-jtools::effect_plot(fit.qpois, 
                    main.title = "Pysically inactive adults ", 
                    pred = Physical_Inactivity, 
                    interval = TRUE,  
                    outcome.scale = "link",
                    partial.residuals = F)
p3<-jtools::effect_plot(fit.qpois, 
                    main.title = "Access to Excercise", 
                    pred = Access_Excercise , 
                    interval = TRUE,  
                    outcome.scale = "link",
                    partial.residuals = F)
p4<-jtools::effect_plot(fit.qpois,
                    main.title = "Food Env Index", 
                    pred = Food_Env_Index, 
                    interval = TRUE,  
                    outcome.scale = "link",
                    partial.residuals = F)
library(patchwork)
(p1+p2)/(p3 +p4)

Fit a Negative Binomial Regression Model

We will fit the negative binomial regression model using glm.nb() function of {MASS} package.

Code
fit.nb <- glm.nb(
       Diabetes_per  ~ 
                Obesity +
                Physical_Inactivity +
                Access_Excercise +
                Food_Env_Index +
                SVI +
                Urban_Rural+
                offset(log_POP_Total),
                data = train)

Model Summary

Code
summary(fit.nb)

Call:
glm.nb(formula = Diabetes_per ~ Obesity + Physical_Inactivity + 
    Access_Excercise + Food_Env_Index + SVI + Urban_Rural + offset(log_POP_Total), 
    data = train, init.theta = 2.033905175, link = log)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.0580592  0.1928039 -10.674  < 2e-16 ***
Obesity             -0.0623632  0.0060850 -10.249  < 2e-16 ***
Physical_Inactivity  0.0270431  0.0069035   3.917 8.95e-05 ***
Access_Excercise    -0.0212910  0.0009096 -23.406  < 2e-16 ***
Food_Env_Index      -0.3049388  0.0211116 -14.444  < 2e-16 ***
SVI                 -1.2840924  0.0758735 -16.924  < 2e-16 ***
Urban_RuralUrban    -1.0691095  0.0388256 -27.536  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.0339) family taken to be 1)

    Null deviance: 5611.1  on 2172  degrees of freedom
Residual deviance: 2418.8  on 2166  degrees of freedom
AIC: 14582

Number of Fisher Scoring iterations: 1

              Theta:  2.0339 
          Std. Err.:  0.0691 

 2 x log-likelihood:  -14566.4880 

Model Performance

Code
performance::performance(fit.nb)
# Indices of model performance

AIC       |      AICc |       BIC | Nagelkerke's R2 |   RMSE | Sigma
--------------------------------------------------------------------
14582.488 | 14582.554 | 14627.959 |           0.833 | 25.062 | 1.000

AIC       | Score_log | Score_spherical
---------------------------------------
14582.488 |    -3.432 |           0.019

Incidence Rate Ratio (IRR)

Code
tidy(fit.nb, exponentiate = TRUE, 
                       conf.int = TRUE)
# A tibble: 7 × 7
  term                estimate std.error statistic   p.value conf.low conf.high
  <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)            0.128  0.193       -10.7  1.34e- 26   0.0877     0.186
2 Obesity                0.940  0.00608     -10.2  1.20e- 24   0.928      0.952
3 Physical_Inactivity    1.03   0.00690       3.92 8.95e-  5   1.01       1.04 
4 Access_Excercise       0.979  0.000910    -23.4  3.68e-121   0.977      0.981
5 Food_Env_Index         0.737  0.0211      -14.4  2.73e- 47   0.707      0.768
6 SVI                    0.277  0.0759      -16.9  2.99e- 64   0.238      0.322
7 Urban_RuralUrban       0.343  0.0388      -27.5  6.48e-167   0.318      0.370
Code
gtsummary::tbl_regression(fit.nb, exponentiate = TRUE)
Characteristic IRR 95% CI p-value
Obesity 0.94 0.93, 0.95 <0.001
Physical_Inactivity 1.03 1.01, 1.04 <0.001
Access_Excercise 0.98 0.98, 0.98 <0.001
Food_Env_Index 0.74 0.71, 0.77 <0.001
SVI 0.28 0.24, 0.32 <0.001
Urban_Rural


    Rural
    Urban 0.34 0.32, 0.37 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Model Comparison

Code
# extract coefficients from poisson model using 'coef()'
coef1 = coef(fit.pois)
# extract coefficients from quasi-model model
coef2 = coef(fit.qpois)
# extract coefficients from negtive binomial model
coef3 = coef(fit.nb)

# extract standard errors f
se.coef1 = summary(fit.pois)$coefficients[, 2]
se.coef2 = summary(fit.qpois)$coefficients[, 2]
se.coef3 = summary(fit.nb)$coefficients[, 2]

# use 'cbind()' to combine values into one dataframe
model.all <- cbind(coef1, se.coef1, coef2, se.coef2, coef3, se.coef3 )
model.all
                          coef1     se.coef1       coef2    se.coef2
(Intercept)         -2.11379767 0.0882815165 -2.11379767 0.276970150
Obesity             -0.02458873 0.0025488700 -0.02458873 0.007996701
Physical_Inactivity  0.02315297 0.0028876050  0.02315297 0.009059432
Access_Excercise    -0.03291368 0.0004250686 -0.03291368 0.001333590
Food_Env_Index      -0.38352972 0.0093947071 -0.38352972 0.029474499
SVI                 -1.69211275 0.0349250360 -1.69211275 0.109572115
Urban_RuralUrban    -1.06352403 0.0199055182 -1.06352403 0.062450607
                          coef3    se.coef3
(Intercept)         -2.05805920 0.192803901
Obesity             -0.06236315 0.006084979
Physical_Inactivity  0.02704309 0.006903477
Access_Excercise    -0.02129096 0.000909622
Food_Env_Index      -0.30493883 0.021111551
SVI                 -1.28409241 0.075873523
Urban_RuralUrban    -1.06910948 0.038825613
Code
p1.irr<-jtools::plot_coefs(fit.pois, scale = TRUE, exp = TRUE) +
    ggtitle("Poisson")
p2.irr<-jtools::plot_coefs(fit.qpois, scale = TRUE, exp = TRUE) +
    ggtitle("Quasi-Poisson")
p3.irr<-jtools::plot_coefs(fit.nb, scale = TRUE, exp = TRUE) +
    ggtitle("Negative-Binomial")
p1.irr+p2.irr+p3.irr

The output figure illustrates that the coefficients in both the Poisson and Quasi-Poisson models are identical. However, the standard errors of the coefficients are different. Specifically, fitting a Quasi-Poisson result in larger standard errors for Food ENV Index, and Urban/Rural compared to a Poisson and Negative Binomial models. The larger standard errors in this models are due to the model being more flexible, allowing for over-dispersion to be accounted for. These models, therefore, have a larger number of parameters to estimate, which leads to a larger variance in the estimated parameters. As a result, the confidence intervals for the coefficients in the Quasi-Poisson and Negative-Binomial models are wider, indicating a lower level of precision in the parameter estimates.

AIC (Akaike Information Criterion) / BIC (Bayesian Information Criterion):

  • Both the AIC and BIC are used to compare models based on how well they fit the data, while penalizing for the number of parameters to avoid overfitting.

  • The model with the lower AIC/BIC value is preferred.

  • To compare Quasi and NB models, fit both models and compare their AIC/BIC values

Code
AIC(fit.pois, fit.nb)
         df      AIC
fit.pois  7 21756.41
fit.nb    8 14582.49
Code
BIC(fit.pois,  fit.nb)
         df      BIC
fit.pois  7 21796.20
fit.nb    8 14627.96

Likelihood Ratio Test (LRT):

The Likelihood Ratio Test (LRT) in R is used to compare two nested models, where one model is a simpler version (restricted) of another, more complex model. In the context of count data models, this can be used to compare a simpler Poisson regression model to a more complex model like a Negative Binomial regression model, provided that they are nested.

The LRT tests the null hypothesis that the simpler model fits the data just as well as the more complex one. The test statistic follows a chi-squared distribution, with degrees of freedom equal to the difference in the number of parameters between the two models.

A likelihood ratio test can be used to compare models that are nested (i.e., one model is a special case of another). However, Poisson and Quasi-piosson are not nested models, so the LRT is not applicable directly for comparing them. You can perform a similar test when comparing simpler versions of the models, like Poisson vs. NB.

Code
# 1. Extract log-likelihoods
logLik.pois <- logLik(fit.pois)
logLik.nb <- logLik(fit.nb)
# 2. Compute the LRT statistic
lrt_stat <- 2 * (logLik.nb - logLik.pois)
# 3. Degrees of freedom difference (df)
df_diff <- attr(logLik.nb, "df") - attr(logLik.nb, "df")
# 4. Compute p-value (chi-squared distribution)
p_value <- pchisq(lrt_stat, df = df_diff, lower.tail = FALSE)
# Output LRT statistic and p-value
cat("LRT Statistic:", lrt_stat, "\n")
LRT Statistic: 7175.922 
Code
cat("Degrees of Freedom Difference:", df_diff, "\n")
Degrees of Freedom Difference: 0 
Code
cat("p-value:", p_value, "\n")
p-value: 0 

The p-value is very small (e.g., less than 0.05), we reject the null hypothesis that the simpler Poisson model is sufficient. This suggests that the more complex negative binomial model is a better fit.

Prediction Performance

The predict() function will be used to predict the number of diabetes patients the test counties. This will help to validate the accuracy of the these regression models.

Code
# Prediction
test$Pred.pois<-predict(fit.pois, test, type = "response")
test$Pred.qpois<-predict(fit.qpois, test, type = "response")
test$Pred.nb<-predict(fit.nb, test, type = "response")
# RMSE
rmse.pois<-Metrics::rmse(test$Diabetes_per, test$Pred.pois)
rmse.qpois<-Metrics::rmse(test$Diabetes_per, test$Pred.qpois)
rmse.nb<-Metrics::rmse(test$Diabetes_per, test$Pred.nb)

cat("RMSE Poisson:", rmse.pois, "\n")
RMSE Poisson: 13.34734 
Code
cat("RMSE Quasi-Poisson:", rmse.qpois, "\n")
RMSE Quasi-Poisson: 13.34734 
Code
cat("RMSE Negative Binomial:", rmse.nb)
RMSE Negative Binomial: 23.25059

Summary and Conclusion

In this R tutorial, we explored the concepts and applications of Poisson, Quasi-Poisson, and Negative Binomial regression models. These models are essential for analyzing count data, where the response variable represents the number of events occurring in a fixed unit of observation.

Initially, we introduced the Poisson regression model, which assumes that the response variable’s mean and variance are equal. However, this assumption may not always be valid in real-world scenarios due to overdispersion, where the variance exceeds the mean. To address this issue, we explored the Quasi-Poisson regression model, which allows for overdispersion by introducing a dispersion parameter.

Furthermore, we discussed the Negative Binomial regression model, which is particularly useful when there is overdispersion and the variance exceeds the mean. This model relaxes the restrictive assumption of equal mean and variance in Poisson regression by incorporating an additional parameter to model the dispersion.

Throughout the tutorial, we provided examples of R code to illustrate the implementation of these regression models using relevant packages such as glm() from the stats package and glm.nb() from the MASS package. We demonstrated how to fit the models, interpret the coefficients, assess model fit using goodness-of-fit tests, and make predictions.

In conclusion, Poisson, Quasi-Poisson, and Negative Binomial regression models offer valuable tools for analyzing count data in various fields such as epidemiology, ecology, and social sciences. Understanding the differences between these models and when to use each one is crucial for accurate and reliable statistical analysis. While Poisson regression assumes equal mean and variance and is suitable for modeling count data with low dispersion, Quasi-Poisson and Negative Binomial regression models provide flexibility to accommodate overdispersion, making them more robust for real-world data. By mastering these regression techniques and leveraging R’s powerful statistical capabilities, researchers and practitioners can gain deeper insights into count data and make informed decisions based on sound statistical analysis.

References

  1. Introduction: what is overdispersion?

  2. Tutorial: Poisson Regression in R

  3. Chapter 4 Poisson Regression

  4. Chapter 10 Poisson regression

Session Info

Code
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AER_1.2-14             lmtest_0.9-40          zoo_1.8-13            
 [4] car_3.1-3              carData_3.0-5          sandwich_3.1-1        
 [7] epiDisplay_3.5.0.2     nnet_7.3-20            MASS_7.3-64           
[10] survival_3.8-3         foreign_0.8-88         ggpmisc_0.6.1         
[13] ggpp_0.5.8-1           Metrics_0.1.4          patchwork_1.3.0       
[16] ggeffects_2.2.0        marginaleffects_0.25.0 margins_0.3.28        
[19] jtools_2.3.0           performance_0.13.0     gtsummary_2.1.0       
[22] rstatix_0.7.2          dlookr_0.6.3           DataExplorer_0.8.3    
[25] plyr_1.8.9             lubridate_1.9.4        forcats_1.0.0         
[28] stringr_1.5.1          dplyr_1.1.4            purrr_1.0.4           
[31] readr_2.1.5            tidyr_1.3.1            tibble_3.2.1          
[34] ggplot2_3.5.1          tidyverse_2.0.0       

loaded via a namespace (and not attached):
  [1] rstudioapi_0.17.1       jsonlite_1.9.0          datawizard_1.0.2       
  [4] magrittr_2.0.3          farver_2.1.2            rmarkdown_2.29         
  [7] vctrs_0.6.5             base64enc_0.1-3         htmltools_0.5.8.1      
 [10] polynom_1.4-1           haven_2.5.4             curl_6.2.1             
 [13] broom_1.0.7             Formula_1.2-5           sass_0.4.9             
 [16] parallelly_1.42.0       htmlwidgets_1.6.4       gt_0.11.1              
 [19] commonmark_1.9.2        networkD3_0.4           igraph_2.1.4           
 [22] mime_0.12               lifecycle_1.0.4         pkgconfig_2.0.3        
 [25] sjlabelled_1.2.0        Matrix_1.7-1            R6_2.6.1               
 [28] fastmap_1.2.0           future_1.34.0           shiny_1.10.0           
 [31] digest_0.6.37           showtext_0.9-7          colorspace_2.1-1       
 [34] furrr_0.3.1             labeling_0.4.3          timechange_0.3.0       
 [37] abind_1.4-8             compiler_4.4.2          bit64_4.6.0-1          
 [40] fontquiver_0.2.1        withr_3.0.2             pander_0.6.5           
 [43] backports_1.5.0         prediction_0.3.18       Rttf2pt1_1.3.12        
 [46] broom.mixed_0.2.9.6     quantreg_6.1            tools_4.4.2            
 [49] reactable_0.4.4         httpuv_1.6.15           extrafontdb_1.0        
 [52] glue_1.8.0              nlme_3.1-166            promises_1.3.2         
 [55] grid_4.4.2              generics_0.1.3          gtable_0.3.6           
 [58] labelled_2.14.0         tzdb_0.4.0              data.table_1.17.0      
 [61] hms_1.1.3               utf8_1.2.4              xml2_1.3.6             
 [64] markdown_1.13           pillar_1.10.1           vroom_1.6.5            
 [67] later_1.4.1             splines_4.4.2           lattice_0.22-5         
 [70] showtextdb_3.0          bit_4.5.0.1             SparseM_1.84-2         
 [73] tidyselect_1.2.1        pagedown_0.22           fontLiberation_0.1.0   
 [76] knitr_1.49              fontBitstreamVera_0.1.1 gridExtra_2.3          
 [79] svglite_2.1.3           xfun_0.51               stringi_1.8.4          
 [82] yaml_2.3.10             kableExtra_1.4.0        evaluate_1.0.3         
 [85] codetools_0.2-20        extrafont_0.19          gdtools_0.4.1          
 [88] cli_3.6.4               xtable_1.8-4            systemfonts_1.2.1      
 [91] munsell_0.5.1           Rcpp_1.0.14             globals_0.16.3         
 [94] parallel_4.4.2          MatrixModels_0.5-3      listenv_0.9.1          
 [97] broom.helpers_1.20.0    viridisLite_0.4.2       hrbrthemes_0.8.7       
[100] scales_1.3.0            sysfonts_0.8.9          crayon_1.5.3           
[103] insight_1.1.0           rlang_1.1.5             cards_0.5.0.9000