6.1. Standard Poisson Regression (Count data)

Count data is commonly encountered in various fields, such as ecology, epidemiology, and social sciences. Poisson regression is a widely used model for analyzing this type of data, where the outcome variable represents the count of occurrences of a specific event. For instance, in epidemiology, researchers might apply Poisson regression to model the number of new disease cases that arise per unit of time within a population.

This tutorial will provide a comprehensive introduction to Poisson regression for count data in R. We will begin with an overview of the Poisson regression model and then implement it from scratch to gain a deeper understanding of its mechanics. Next, we will learn how to fit the model using R’s glm() function for practical applications. Finally, we will explore evaluation techniques, including calculating the incidence rate ratio (IRR), a crucial interpretive tool in count data models.

Overview

Poisson Regression is a type of Generalized Linear Model (GLM) used for modeling count data. Standard Poisson Regression Model considers only the number of events and does not take into account any additional exposure or observation time.

It is particularly useful when the dependent variable represents the number of occurrences of an event over a fixed period of time, space, or some other exposure, and the data exhibit a Poisson distribution. Standard Poisson Regression Model considers only the number of events and does not take into account any additional exposure or observation time.

Key characteristics of Poisson Regression:

Count Data: The response variable $Y$ is a non-negative integer representing counts (0, 1, 2, …).
Poisson Distribution: The response variable $Y$ follows a Poisson distribution with parameter $\lambda$ where $\lambda$ is the expected count. The Poisson distribution assumes that the mean and variance of the response variable are equal.

\[ Y_i \sim \text{Poisson}(\lambda_i) \]

Log Link Function: The model uses a logarithmic link function to relate the linear predictor to the Poisson mean, $\lambda$, which ensures that the expected value $\lambda$ is positive.

\[ \log(\lambda_i) = X_i \beta \]

Where:

-   $\lambda_i$ = $E(Y_i | X)$ is the expected count for the $i^{th}$ observation.
-   $X_i$ is the vector of predictor variables for the $i^{th}$ observation.
-   $\beta$ is the vector of coefficients to be estimated.

Model Formulation:

The Poisson regression model assumes that the log of the expected count (rate) of the dependent variable (Y) is a linear combination of the predictor variables $X_1, X_2, \dots, X_p$. The model can be written as:

\[ \log(\lambda_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dots + \beta_p X_{pi} \]

This can be rewritten as:

\[ \lambda_i = \exp(\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dots + \beta_p X_{pi}) \]

Here: - $\lambda_i$ is the expected count (mean) of the dependent variable for the $(i^{th}$ observation. - $\beta_0$ is the intercept. - $\beta_1, \dots, \beta_p$ are the regression coefficients that quantify the effect of the predictor variables on the expected count.

Interpretation of Coefficients:

Intercept $\beta_0$: The log of the expected count when all predictors are zero.
Predictor Coefficients $\beta_1, \beta_2, \dots$: A one-unit increase in a predictor $X_j$ is associated with a multiplicative effect on the expected count, given by $e^{\beta_j}$. For example:

\[ e^{\beta_j} = \frac{\lambda_i(\text{new value of } X_j)}{\lambda_i(\text{old value of } X_j)} \]

This indicates the factor by which the expected count $\lambda_i$ changes for a one-unit increase in (X_j), holding other variables constant.

Poisson Probability Mass Function (PMF):

The probability mass function of the Poisson distribution is given by:

\[ P(Y = y) = \frac{\lambda^y e^{-\lambda}}{y!}, \quad y = 0, 1, 2, \dots \]

Where $\lambda$ is the mean count (expected value) and $y$ is the observed count.

Interpretation of Results:

The coefficients from the model will indicate how a one-unit change in the predictor (e.g., age or income) affects the log of the expected count of accidents.
The exponentiated coefficients (exp(coef)) can be interpreted as the multiplicative change in the expected number of accidents for a one-unit increase in the predictor.

Model Assumptions:

Mean-Variance Equality: The Poisson model assumes that the mean and variance of the response variable are equal. If the data exhibit overdispersion (variance greater than the mean), the standard Poisson model may not be appropriate, and you may need to use an alternative like the Negative Binomial regression.
Independence of Observations: The counts for each observation are assumed to be independent.
Linearity on Log Scale: The log of the expected count is assumed to be a linear function of the predictor variables.

This model is widely used for count data where the occurrence of events is rare and spread over time or space, such as disease incidence, traffic accidents, or insurance claims.

Standard Poisson Regression Model

Here’s how we can fit a standard Poisson regression model manually in R using count data. Let’s assume your dataset contains a response variable y (count data) and four predictor variables $X_1, X_2, \dots, X_4$.

Create Dataset

Code

# Set seed for reproducibility
set.seed(123)

# Generate sample data with 4 predictors
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)

# True coefficients
beta0_true <- 0.5
beta1_true <- 0.3
beta2_true <- -0.2
beta3_true <- 0.4
beta4_true <- 0.1

# Calculate lambda (mean of the Poisson) for each observation
lambda <- exp(beta0_true + beta1_true * x1 + beta2_true * x2 + beta3_true * x3 + beta4_true * x4)

# Generate response variable y as count data
y <- rpois(n, lambda)

# Combine into a data frame
data <- data.frame(y, x1, x2, x3, x4)
head(data)

  y          x1          x2         x3         x4
1 3 -0.56047565 -0.71040656  2.1988103 -0.7152422
2 2 -0.23017749  0.25688371  1.3124130 -0.7526890
3 1  1.55870831 -0.24669188 -0.2651451 -0.9385387
4 0  0.07050839 -0.34754260  0.5431941 -1.0525133
5 1  0.12928774 -0.95161857 -0.4143399 -0.4371595
6 6  1.71506499 -0.04502772 -0.4762469  0.3311792

Specify the Poisson Model

The Poisson model for count data assumes that the mean of y, given $X_1, X_2, \dots, X_4$ , follows a log-linear relationship:

\[ \log(\lambda_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dots + \beta_p X_{pi} \]

Or equivalently:

\[ \lambda_i = \exp(\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dots + \beta_p X_{pi}) \]

In R, you can write the log-likelihood function as follows:

Code

# Log-likelihood function for Poisson regression with 4 predictors
poisson_log_likelihood <- function(params) {
  beta0 <- params[1]
  beta1 <- params[2]
  beta2 <- params[3]
  beta3 <- params[4]
  beta4 <- params[5]
  
  lambda <- exp(beta0 + beta1 * data$x1 + beta2 * data$x2 + beta3 * data$x3 + beta4 * data$x4)
  log_likelihood <- sum(dpois(data$y, lambda, log = TRUE))
  
  return(-log_likelihood)  # Return negative log-likelihood for minimization
}

Optimize the Parameters

Use the optim() function to maximize the log-likelihood and obtain parameter. It is a eneral-purpose optimization based on Nelder–Mead, quasi-Newton and conjugate-gradient algorithms. It includes an option for box-constrained optimization and simulated annealing.

Code

# Initial guesses for beta0 to beta4
initial_params <- c(0, 0, 0, 0, 0)

# Use optim to find MLE for the parameters
fit <- optim(par = initial_params, # Initial values for the parameters to be optimized over.
             fn = poisson_log_likelihood,
             hessian = TRUE) #  Should a numerically differentiated Hessian matrix be returned

# Extract parameter estimates
params_hat <- fit$par
cat("Estimated coefficients:", params_hat, "\n")

Estimated coefficients: 0.5198836 0.2086972 -0.2169469 0.3623886 0.1848185

Calculate Standard Errors, Z-scores, and P-values

After fitting the Poisson regression model manually with optim(), we can calculate the standard errors, Z-scores, and P-values of the estimated coefficients. Here’s how:

Calculate the Hessian matrix: The Hessian matrix is the matrix of second derivatives of the log-likelihood function with respect to the parameters. The inverse of the Hessian provides an estimate of the variance-covariance matrix of the parameter estimates.
Extract standard errors from the variance-covariance matrix.
Calculate Z-scores and P-values based on standard errors.

Let’s implement these steps in R.

Code

# Invert the Hessian matrix to get the covariance matrix
cov_matrix <- solve(fit$hessian)

# Standard errors
std_errors <- sqrt(diag(cov_matrix))

# Z-scores
z_scores <- params_hat / std_errors

# P-values (2-sided)
p_values <- 2 * (1 - pnorm(abs(z_scores)))

# Create summary statistics table
summary_table <- data.frame(
  Estimate = params_hat,
  Std.Error = std_errors,
  Z.value = z_scores,
  P.value = p_values
)

# Set row names as parameter names
rownames(summary_table) <- c("Intercept", "x1", "x2", "x3", "x4")

# Display summary table
print(summary_table)

            Estimate  Std.Error   Z.value      P.value
Intercept  0.5198836 0.08395270  6.192577 5.918832e-10
x1         0.2086972 0.07531222  2.771094 5.586837e-03
x2        -0.2169469 0.07911883 -2.742038 6.105919e-03
x3         0.3623886 0.07241511  5.004323 5.605859e-07
x4         0.1848185 0.07097185  2.604110 9.211316e-03

This manual approach should yield parameter estimates that are very close to those obtained by glm(), as both methods maximize the same likelihood function.

Standard Poisson Model with R

In this exercise we will develop a standard Poisson model in R with a built function glm()to explain the variability the total diagnosed diabetes per county (count data) in the USA.

Check and Install Required R packages

Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:

Code

packages <- c('tidyverse',
     'plyr',
      'DataExplorer',
         'dlookr',
         'rstatix',
         'gtsummary',
         'performance',
         'jtools',
         'margins',
         'marginaleffects',
         'ggeffects',
         'patchwork',
         'Metrics',
         'ggpmisc',
         'epiDisplay',
         'sandwich'
          )

#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Load R-packages

Code

# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Code

# Check loaded packages
cat("Successfully loaded packages:\n")

Successfully loaded packages:

Code

print(search()[grepl("package:", search())])

 [1] "package:sandwich"        "package:epiDisplay"     
 [3] "package:nnet"            "package:MASS"           
 [5] "package:survival"        "package:foreign"        
 [7] "package:ggpmisc"         "package:ggpp"           
 [9] "package:Metrics"         "package:patchwork"      
[11] "package:ggeffects"       "package:marginaleffects"
[13] "package:margins"         "package:jtools"         
[15] "package:performance"     "package:gtsummary"      
[17] "package:rstatix"         "package:dlookr"         
[19] "package:DataExplorer"    "package:plyr"           
[21] "package:lubridate"       "package:forcats"        
[23] "package:stringr"         "package:dplyr"          
[25] "package:purrr"           "package:readr"          
[27] "package:tidyr"           "package:tibble"         
[29] "package:ggplot2"         "package:tidyverse"      
[31] "package:stats"           "package:graphics"       
[33] "package:grDevices"       "package:utils"          
[35] "package:datasets"        "package:methods"        
[37] "package:base"

Data

The County-level age-adjusted number and rate of diabetes patients, prevalence of obesity, physical inactivity and Food environment index for the year 2016-2020 were obtained from United States Diabetes Surveillance System (USDSS).

Full data set is available for download from my Dropbox or from my Github accounts.

Dataset contains five years average (2016-2020) of following variables :

Diabetes_count - Diabetes number per county (Diabetes Surveillance System (USDSS))
Diabetes_per - Diabetes number per county (Diabetes Surveillance System (USDSS))
Urban_Rural - Urban Rural County (USDA)
PPO_total - Total population per county (US Census)
Obesity - % obesity per county (Behavioral Risk Factor Surveillance System)
Physical_Inactivity: % adult access to exercise opportunities (County Health Ranking)
SVI - Level of social vulnerability in the county relative to other counties in the nation or within the state.ocial vulnerability refers to the potential negative effects on communities caused by external stresses on human health. The CDC/ATSDR Social vulnerability Index (SVI) ranks all US counties on 15 social factors, including poverty, lack of vehicle access, and crowded housing, and groups them into four related themes. ( CDC/ATSDR Social Vulnerability Index (SVI))
Food_Env_Index: Measure of access to healthy food. The Food Environment Index ranges from a scale of 0 (worst) to 10 (best) and equally weights two indicators: 1) Limited access to healthy foods based on distance an individual lives from a grocery store or supermarket, locations for healthy food purchases in most communities; and 2) Food insecurity defined as the inability to access healthy food because of cost barriers.County Health Ranking

We will use read_csv() function of {readr} package to import data as a tidy data.

Code

# load data
mf<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/Regression_analysis/county_data_2016_2020.csv")

Rows: 3107 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): State, County, Urban_Rural
dbl (12): FIPS, X, Y, POP_Total, Diabetes_count, Diabetes_per, Obesity, Acce...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

# select variables
df<-mf |> 
  dplyr::select(Diabetes_count,
                Obesity,
                Physical_Inactivity, 
                Access_Excercise,
                Food_Env_Index,
                SVI,
                Urban_Rural 
                ) |> 
  glimpse()

Rows: 3,107
Columns: 7
$ Diabetes_count      <dbl> 4282, 17477, 2659, 1966, 5318, 796, 2128, 13250, 3…
$ Obesity             <dbl> 29.22, 28.94, 29.34, 29.44, 30.10, 19.86, 30.38, 3…
$ Physical_Inactivity <dbl> 26.42, 22.86, 23.72, 25.38, 24.76, 18.58, 28.66, 2…
$ Access_Excercise    <dbl> 70.8, 72.2, 49.8, 30.6, 24.6, 19.6, 48.0, 51.4, 62…
$ Food_Env_Index      <dbl> 6.9, 7.7, 5.5, 7.6, 8.1, 4.3, 6.5, 6.3, 6.4, 7.7, …
$ SVI                 <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.8310, 0.…
$ Urban_Rural         <chr> "Urban", "Urban", "Rural", "Urban", "Urban", "Rura…

Code

# data processing
df$Diabetes_count<-as.integer(df$Diabetes_count)  
df$Urban_Rural<-as.factor(df$Urban_Rural)

Data Description

The {epidisplay} package can provide both numerical and categorical statistics simultaneously with the codebook() function. It’s a great tool for descriptive statistics.

Code

epiDisplay::codebook(df[,1:7])


 
 
Diabetes_count   :    

No. of observations = 3107

  Var. name      obs. mean    median  s.d.     min.   max.  
1 Diabetes_count 3107 7733.95 2078    23560.52 17     703508

 ================== 
Obesity      :    

No. of observations = 3107

  Var. name obs. mean   median  s.d.   min.   max.  
1 Obesity   3107 27.59  27.86   4.8    12.06  42.12 

 ================== 
Physical_Inactivity      :    

No. of observations = 3107

  Var. name           obs. mean   median  s.d.   min.   max.  
1 Physical_Inactivity 3107 21.16  20.84   4.3    9.64   36.72 

 ================== 
Access_Excercise     :    

No. of observations = 3107

  Var. name        obs. mean   median  s.d.   min.   max.  
1 Access_Excercise 3107 61.98  64      21.78  0      100   

 ================== 
Food_Env_Index   :    

No. of observations = 3107

  Var. name      obs. mean   median  s.d.   min.   max.  
1 Food_Env_Index 3107 7.32   7.5     1.09   1.6    10    

 ================== 
SVI      :    

No. of observations = 3107

  Var. name obs. mean   median  s.d.   min.   max.  
1 SVI       3107 0.5    0.5     0.29   0      1     

 ================== 
Urban_Rural      :    

No. of observations = 3107

  Var. name   obs. mean   median  s.d.   min.   max.  
1 Urban_Rural                                         

 ==================

Density Plot

Code

ggplot(df, aes(Diabetes_count)) +
  geom_density()+
  # x-axis title
  xlab("Diabetes count per county)") + 
  # y-axis title
  ylab("Density")+
  # plot title
  ggtitle("Kernel Density of Diabetes")+
    theme(
    # Center the plot title
    plot.title = element_text(hjust = 0.5))

Descriptive Statistics

Code

# Standard error
SE <- function(x){
  sd(x)/sqrt(length(x))
}

# Get summary statistics
summarise_diabetes<-plyr::ddply(df,~ Urban_Rural, summarise, 
                Mean= round(mean(Diabetes_count), 2),
                Median=round (median(Diabetes_count), 2),
                Min= round (min(Diabetes_count),2), 
                Max= round (max(Diabetes_count),2), 
                SD= round(sd(Diabetes_count), 2), 
                SE= round (SE(Diabetes_count), 3))
# Load library
library(flextable)


Attaching package: 'flextable'

The following object is masked from 'package:jtools':

    theme_apa

The following object is masked from 'package:gtsummary':

    continuous_summary

The following object is masked from 'package:purrr':

    compose

Code

# Create a table
flextable::flextable(summarise_diabetes, theme_fun = theme_booktabs)

Urban_Rural	Mean	Median	Min	Max	SD	SE
Rural	1,169.59	875	17	7,367	1,037.88	28.676
Urban	12,519.33	4,438	24	703,508	30,080.86	709.604

Barplot - Urban vs Rural

Code

ggplot(summarise_diabetes, aes(x=Urban_Rural, y=Mean)) + 
  geom_bar(stat="identity", position=position_dodge(),width=0.5, fill="gray") +
  geom_errorbar(aes(ymin=Mean-SE, ymax=Mean+SE), width=.2,
   position=position_dodge(.9))+
  # add y-axis title and x-axis title leave blank
  labs(y="Diabetes count per county", x = "")+ 
  # add plot title
  ggtitle("Mean ± SE of Diabetes")+
  coord_flip()+
  # customize plot themes
  theme(
        axis.line = element_line(colour = "gray"),
        # plot title position at center
        plot.title = element_text(hjust = 0.5),
        # axis title font size
        axis.title.x = element_text(size = 14), 
        # X and  axis font size
        axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5, colour='black'),
        axis.text.x = element_text(size=12))

Correlation

plot.correlate() function {dlookr} package visualizes the correlation matrix of a dataframe:

Code

df |> 
  # select variables
  dplyr::select (Diabetes_count,
                Obesity,
                Physical_Inactivity, 
                Access_Excercise,
                Food_Env_Index,
                SVI
                
                ) |>  
  dlookr::correlate() |>
  plot()

Let explore correlation in rural and urban counties:

Code

df |> 
  group_by(Urban_Rural) |> 
  # select variables
  dplyr::select (Diabetes_count,
                Obesity,
                Physical_Inactivity, 
                Access_Excercise,
                Food_Env_Index,
                SVI) |> 
  dlookr::correlate() |>
  plot()

Split Data

We will use the ddply() function of the {plyr} package to split soil carbon datainto homogeneous subgroups using stratified random sampling. This method involves dividing the population into strata and taking random samples from each stratum to ensure that each subgroup is proportionally represented in the sample. The goal is to obtain a representative sample of the population by adequately representing each stratum.

Code

seeds = 11076
tr_prop = 0.70
# training data (70% data)
train= ddply(df,.(Urban_Rural),
                 function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test = ddply(df, .(Urban_Rural),
            function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)

Code

# Density plot all, train and test data 
ggplot()+
  geom_density(data = df, aes(Diabetes_count))+
  geom_density(data = train, aes(Diabetes_count), color = "green")+
  geom_density(data = test, aes(Diabetes_count), color = "red") +
      xlab("Total Diabetes") + 
     ylab("Density")

Fit a standard Poisson Model

We will fit a Poisson regression model using the glm() function in R. We specify family = poisson(link = "log") to indicate that we want to fit a Poisson regression model. Here we model the diabetes rate per county, The offset variable, here is log population per county need to be defined in model. This offset variable adjusts for the differing number of diabetes patients in different population levels per county.

Code

fit.pois <- glm(
 Diabetes_count ~  .,train,
             family = poisson(link = "log"))

Model Summary

summary() function produce result summaries of the results of model fitting functions.

Code

summary(fit.pois)


Call:
glm(formula = Diabetes_count ~ ., family = poisson(link = "log"), 
    data = train)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.518e+00  3.944e-03  -384.8   <2e-16 ***
Obesity              1.073e-02  7.306e-05   146.9   <2e-16 ***
Physical_Inactivity  2.464e-02  8.873e-05   277.7   <2e-16 ***
Access_Excercise     5.281e-02  1.933e-05  2731.9   <2e-16 ***
Food_Env_Index       4.435e-01  3.878e-04  1143.8   <2e-16 ***
SVI                  2.274e+00  1.272e-03  1787.5   <2e-16 ***
Urban_RuralUrban     1.415e+00  1.039e-03  1361.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 44477976  on 2172  degrees of freedom
Residual deviance: 15615853  on 2166  degrees of freedom
AIC: 15636634

Number of Fisher Scoring iterations: 6

The adequacy of a model is typically determined by evaluating the difference between Null deviance and Residual deviance, with a larger discrepancy between the two values indicating a better fit. Null deviance denotes the value obtained when the equation comprises solely the intercept without any variables, while Residual deviance denotes the value calculated when all variables are taken into account. The model can be deemed an appropriate fit when the difference between the two values is substantial enough.

To obtain the AIC (Akaike Information Criterion) values for a GLM model in R, you can use the AIC() function applied to the fitted model. The lower the AIC value, the better the model fits the data while penalizing for the number of parameters. You can compare AIC values between different models to assess their relative goodness-of-fit.

report() function of {report} package generate a brief report of fitted model

Code

report::report(fit.pois)

We fitted a poisson model (estimated using ML) to predict Diabetes_count with
Obesity, Physical_Inactivity, Access_Excercise, Food_Env_Index, SVI and
Urban_Rural (formula: Diabetes_count ~ Obesity + Physical_Inactivity +
Access_Excercise + Food_Env_Index + SVI + Urban_Rural). The model's explanatory
power is substantial (Nagelkerke's R2 = 1.00). The model's intercept,
corresponding to Obesity = 0, Physical_Inactivity = 0, Access_Excercise = 0,
Food_Env_Index = 0, SVI = 0 and Urban_Rural = Rural, is at -1.52 (95% CI
[-1.53, -1.51], p < .001). Within this model:

  - The effect of Obesity is statistically significant and positive (beta = 0.01,
95% CI [0.01, 0.01], p < .001; Std. beta = 0.05, 95% CI [0.05, 0.05])
  - The effect of Physical Inactivity is statistically significant and positive
(beta = 0.02, 95% CI [0.02, 0.02], p < .001; Std. beta = 0.11, 95% CI [0.11,
0.11])
  - The effect of Access Excercise is statistically significant and positive
(beta = 0.05, 95% CI [0.05, 0.05], p < .001; Std. beta = 1.16, 95% CI [1.15,
1.16])
  - The effect of Food Env Index is statistically significant and positive (beta
= 0.44, 95% CI [0.44, 0.44], p < .001; Std. beta = 0.48, 95% CI [0.48, 0.48])
  - The effect of SVI is statistically significant and positive (beta = 2.27, 95%
CI [2.27, 2.28], p < .001; Std. beta = 0.66, 95% CI [0.66, 0.66])
  - The effect of Urban Rural [Urban] is statistically significant and positive
(beta = 1.41, 95% CI [1.41, 1.42], p < .001; Std. beta = 1.41, 95% CI [1.41,
1.42])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald z-distribution approximation.

The {jtools} package consists of a series of functions to create summary of the poisson regression model:

Code

jtools::summ(fit.pois)

Observations	2173
Dependent variable	Diabetes_count
Type	Generalized linear model
Family	poisson
Link	log

𝛘²(6)	28862122.79
p	0.00
Pseudo-R² (Cragg-Uhler)	1.00
Pseudo-R² (McFadden)	0.65
AIC	15636634.19
BIC	15636673.98

	Est.	S.E.	z val.	p
(Intercept)	-1.52	0.00	-384.85	0.00
Obesity	0.01	0.00	146.91	0.00
Physical_Inactivity	0.02	0.00	277.74	0.00
Access_Excercise	0.05	0.00	2731.86	0.00
Food_Env_Index	0.44	0.00	1143.79	0.00
SVI	2.27	0.00	1787.55	0.00
Urban_RuralUrban	1.41	0.00	1361.17	0.00
Standard errors: MLE

We utilized the R package {sandwich} to obtain the robust standard errors and subsequently calculated the p-values. Additionally, we computed the 95% confidence interval using the parameter estimates and their robust standard errors.

Code

cov.m1 <- vcovHC(fit.pois, type="HC0")
std.err <- sqrt(diag(cov.m1))
r.est <- cbind(Estimate= coef(fit.pois), "Robust SE" = std.err,
"Pr(>|z|)" = 2 * pnorm(abs(coef(fit.pois)/std.err), lower.tail=FALSE),
LL = coef(fit.pois) - 1.96 * std.err,
UL = coef(fit.pois) + 1.96 * std.err)

r.est

                       Estimate  Robust SE      Pr(>|z|)          LL
(Intercept)         -1.51784294 0.67844166  2.527007e-02 -2.84758859
Obesity              0.01073372 0.01189857  3.670031e-01 -0.01258749
Physical_Inactivity  0.02464426 0.01881874  1.903450e-01 -0.01224047
Access_Excercise     0.05281353 0.00236808 3.513029e-110  0.04817210
Food_Env_Index       0.44352007 0.07757945  1.084344e-08  0.29146436
SVI                  2.27445881 0.38695607  4.157001e-09  1.51602492
Urban_RuralUrban     1.41464909 0.07362727  2.845042e-82  1.27033964
                             UL
(Intercept)         -0.18809729
Obesity              0.03405492
Physical_Inactivity  0.06152899
Access_Excercise     0.05745497
Food_Env_Index       0.59557578
SVI                  3.03289270
Urban_RuralUrban     1.55895854

Goodness of fit test of poisson can be explored by poisgof() function of {epiDisplay} package:

Code

epiDisplay::poisgof(fit.pois)

$results
[1] "Goodness-of-fit test for Poisson assumption"

$chisq
[1] 15615853

$df
[1] 2166

$p.value
[1] 0

We can generate a report using report() function of {reoprt} package:

Code

report::report(fit.pois)

We fitted a poisson model (estimated using ML) to predict Diabetes_count with
Obesity, Physical_Inactivity, Access_Excercise, Food_Env_Index, SVI and
Urban_Rural (formula: Diabetes_count ~ Obesity + Physical_Inactivity +
Access_Excercise + Food_Env_Index + SVI + Urban_Rural). The model's explanatory
power is substantial (Nagelkerke's R2 = 1.00). The model's intercept,
corresponding to Obesity = 0, Physical_Inactivity = 0, Access_Excercise = 0,
Food_Env_Index = 0, SVI = 0 and Urban_Rural = Rural, is at -1.52 (95% CI
[-1.53, -1.51], p < .001). Within this model:

  - The effect of Obesity is statistically significant and positive (beta = 0.01,
95% CI [0.01, 0.01], p < .001; Std. beta = 0.05, 95% CI [0.05, 0.05])
  - The effect of Physical Inactivity is statistically significant and positive
(beta = 0.02, 95% CI [0.02, 0.02], p < .001; Std. beta = 0.11, 95% CI [0.11,
0.11])
  - The effect of Access Excercise is statistically significant and positive
(beta = 0.05, 95% CI [0.05, 0.05], p < .001; Std. beta = 1.16, 95% CI [1.15,
1.16])
  - The effect of Food Env Index is statistically significant and positive (beta
= 0.44, 95% CI [0.44, 0.44], p < .001; Std. beta = 0.48, 95% CI [0.48, 0.48])
  - The effect of SVI is statistically significant and positive (beta = 2.27, 95%
CI [2.27, 2.28], p < .001; Std. beta = 0.66, 95% CI [0.66, 0.66])
  - The effect of Urban Rural [Urban] is statistically significant and positive
(beta = 1.41, 95% CI [1.41, 1.42], p < .001; Std. beta = 1.41, 95% CI [1.41,
1.42])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald z-distribution approximation.

Model Performance

performance() function of {performance} package compute indices of model performance for poisson model.

Code

performance::performance(fit.pois)

# Indices of model performance

AIC       |      AICc |       BIC | Nagelkerke's R2 |      RMSE | Sigma
-----------------------------------------------------------------------
1.564e+07 | 1.564e+07 | 1.564e+07 |           1.000 | 20220.385 | 1.000

AIC       | Score_log | Score_spherical
---------------------------------------
1.564e+07 |      -Inf |           0.003

Note

Nagelkerke’s $R^2$, also known as the Nagelkerke pseudo-$R^2$, is a measure of the proportion of variance explained by a logistic regression model. It is an adaptation of Cox and Snell’s $R^2$ to overcome its limitation of having a maximum value less than 1. Nagelkerke’s $R^2$ ranges from 0 to 1 and provides a measure of the overall fit of the logistic regression model.

Mathematically, Nagelkerke’s R^2 is defined as:

\[ R^2_{\text{Nagelkerke}} = 1 - \left(\frac{-2 \cdot \text{Log-Likelihood}_{\text{model}}}{\text{Log-Likelihood}_{\text{null model}}} \right)^{\frac{2}{n}} \]

where:

Log-Likelihood_model is the log-likelihood of the fitted logistic regression model.
Log-Likelihood_null model is the log-likelihood of the null model (a logistic regression model with only the intercept term).
n is the total number of observations in the dataset.

Nagelkerke’s $R^2$ provides a useful measure to evaluate the goodness of fit of a logistic regression model, but it should be interpreted with caution, especially when the model has categorical predictors or interactions. Additionally, like other R^2 measures, Nagelkerke’s $R^2$ does not indicate the quality of predictions made by the model.

Model Diagnostics

The package {performance} provides many functions to check model assumptions, like check_overdispersion(), check_zeroinflation().

Check for Overdispersion

Overdispersion occurs when the observed variance in the data is higher than the expected variance from the model assumption (for Poisson, variance roughly equals the mean of an outcome). check_overdispersion() checks if a count model (including mixed models) is overdispersed or not.

Code

performance::check_overdispersion(fit.pois)

# Overdispersion test

       dispersion ratio =    11872.168
  Pearson's Chi-Squared = 25715115.315
                p-value =      < 0.001

Overdispersion detected.

Overdispersion can be fixed by either modelling the dispersion parameter (not possible with all packages), or by choosing a different distributional family (like Quasi-Poisson, or negative binomial, see (Gelman and Hill 2007)).

Check for Zero-inflation

Zero-inflation (in (Quasi-)Poisson models) is indicated when the amount of observed zeros is larger than the amount of predicted zeros, so the model is underfitting zeros. In such cases, it is recommended to use negative binomial or zero-inflated models.

Use check_zeroinflation() to check if zero-inflation is present in the fitted model.

Code

performance::check_zeroinflation(fit.pois)

Model has no observed zeros in the response variable.

NULL

Check for Singular Model Fits

A “singular” model fit means that some dimensions of the variance-covariance matrix have been estimated as exactly zero. This often occurs for mixed models with overly complex random effects structures.

check_singularity() checks mixed models (of class lme, merMod, glmmTMB or MixMod) for singularity, and returns TRUE if the model fit is singular.

Code

check_singularity(fit.pois)

[1] FALSE

Visualization of Model Assumptions

To get a comprehensive check and visualization, use check_model().

Code

performance::check_model(fit.pois)

Incidence Rate Ratio (IRR)

The Incidence Rate Ratio (IRR) is a measure commonly used in epidemiology and other fields to quantify the association between an exposure or predictor variable and an outcome, particularly when dealing with count data. It is often used in the context of Poisson regression models.

In Poisson regression, the exponentiated coefficients (i.e., exponentiated regression coefficients) are interpreted as Incidence Rate Ratios. Specifically, for a given predictor variable, the IRR represents the multiplicative change in the rate of the outcome for each unit change in the predictor variable.

Mathematically, if $\beta$ is the coefficient estimate of a predictor variable in a Poisson regression model, then the corresponding IRR, denoted as ( ), is calculated as:

\[ \text{IRR} = e^{\beta} \]

where $e$ is the base of the natural logarithm (approximately equal to 2.718).

Interpretation of the IRR:

If $text{IRR} = 1$, it implies that there is no association between the predictor variable and the outcome.
If $\text{IRR} > 1$, it indicates that an increase in the predictor variable is associated with an increased incidence rate (or risk) of the outcome.
If $\text{IRR} < 1$, it suggests that an increase in the predictor variable is associated with a decreased incidence rate (or risk) of the outcome.

For example, if the IRR associated with a particular exposure is 1.5, it means that the incidence rate of the outcome is 1.5 times higher in the exposed group compared to the unexposed group, all else being equal.

The IRR provides a convenient way to quantify and interpret the strength of association between predictor variables and outcomes in Poisson regression models, particularly when dealing with count data and incidence rates.

Let’s estimate the IRR value of Obesity, where it coefficient is 0.01073372 . The exponent of this value is:

Code

round(exp(0.01073372 ), 3)

[1] 1.011

Code

abs((1-exp( 0.01073372 ))*100)

[1] 1.079153

Code

model_01_IRR = tidy(fit.pois, exponentiate = TRUE, 
                       conf.int = TRUE)
model_01_IRR

# A tibble: 7 × 7
  term                estimate std.error statistic p.value conf.low conf.high
  <chr>                  <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)            0.219 0.00394       -385.       0    0.217     0.221
2 Obesity                1.01  0.0000731      147.       0    1.01      1.01 
3 Physical_Inactivity    1.02  0.0000887      278.       0    1.02      1.03 
4 Access_Excercise       1.05  0.0000193     2732.       0    1.05      1.05 
5 Food_Env_Index         1.56  0.000388      1144.       0    1.56      1.56 
6 SVI                    9.72  0.00127       1788.       0    9.70      9.75 
7 Urban_RuralUrban       4.12  0.00104       1361.       0    4.11      4.12

The tbl_regression() function from the {}* package takes a regression model object as input and produces a formatted table with IRR and confidence interval.

Code

gtsummary::tbl_regression(fit.pois)

Characteristic	log(IRR)	95% CI	p-value
Obesity	0.01	0.01, 0.01	<0.001
Physical_Inactivity	0.02	0.02, 0.02	<0.001
Access_Excercise	0.05	0.05, 0.05	<0.001
Food_Env_Index	0.44	0.44, 0.44	<0.001
SVI	2.3	2.3, 2.3	<0.001
Urban_Rural
Rural	—	—
Urban	1.4	1.4, 1.4	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Based on this table, we may interpret the results as follows:

Urban counties have an high number of diabetes with an IRR of 2.38 (95% CI: 2.37, 2.37), while controlling for the effect other variables.
An increase in Obesity one mark increases the risk of having higher number diabetes patients by 0.98 (95% CI: 0.98, 0.98), while while controlling for the effect other variables

Marginal Effects and Adjusted Predictions

If we want the marginal effects for “Obesity”, you may use margins() function of {margins} package:

Code

margins::margins(fit.pois, variables = "Obesity")

Average marginal effects

glm(formula = Diabetes_count ~ ., family = poisson(link = "log"),     data = train)

 Obesity
   83.35

predict_response() function of {ggeffects} calculates predicted count and plot()- method automatically sets titles, axis - and legend-labels depending on the value and variable labels of the data

Code

res <- predict_response(fit.pois, terms = c("Obesity", "Urban_Rural"))
res

# Predicted counts of Diabetes_count

Urban_Rural: Rural

Obesity | Predicted |           95% CI
--------------------------------------
     15 |    906.42 |  904.05,  908.79
     20 |    956.39 |  954.29,  958.50
     25 |   1009.12 | 1007.14, 1011.11
     30 |   1064.76 | 1062.65, 1066.87
     35 |   1123.47 | 1120.95, 1125.98
     45 |   1250.77 | 1246.76, 1254.78

Urban_Rural: Urban

Obesity | Predicted |           95% CI
--------------------------------------
     15 |   3729.94 | 3721.49, 3738.41
     20 |   3935.59 | 3929.12, 3942.08
     25 |   4152.58 | 4147.85, 4157.32
     30 |   4381.53 | 4377.34, 4385.73
     35 |   4623.11 | 4617.34, 4628.89
     45 |   5146.95 | 5134.44, 5159.49

Adjusted for:
* Physical_Inactivity = 21.14
*    Access_Excercise = 61.81
*      Food_Env_Index =  7.33
*                 SVI =  0.50

Code

plot(res)

effect_plot() function of {jtools} package plot simple effects in poisson regression models:

Code

p1<-jtools::effect_plot(fit.pois, 
                    main.title = "Obesity", 
                    pred = Obesity, 
                    interval = T,  
                    outcome.scale = "link",
                    partial.residuals = F)
p2<-jtools::effect_plot(fit.pois, 
                    main.title = "Pysically inactive adults ", 
                    pred = Physical_Inactivity, 
                    interval = TRUE,  
                    outcome.scale = "link",
                    partial.residuals = F)
p3<-jtools::effect_plot(fit.pois, 
                    main.title = "Access to Excercise", 
                    pred = Access_Excercise , 
                    interval = TRUE, 
                    outcome.scale = "link",
                    partial.residuals = F)
p4<-jtools::effect_plot(fit.pois,
                    main.title = "Food Env Index", 
                    pred = Food_Env_Index, 
                    interval = TRUE,  
                    outcome.scale = "link",
                    partial.residuals = F)
library(patchwork)
(p1+p2)/(p3 +p4)

Prediction Performance

The predict() function will be used to predict the number of diabetes patients the test counties. This will help to validate the accuracy of the these regression model.

Code

test$Pred.diabetes<-predict(fit.pois, test, type = "response")
Metrics::rmse(test$Diabetes_count, test$Pred.diabetes)

[1] 16111.9

Code

Metrics::mae(test$Diabetes_count, test$Pred.diabetes)

[1] 5872.592

Summary and Conclusion

Poisson regression is a useful tool for modeling count data, where the response variable represents the number of occurrences of an event. It offers a simple and effective way to analyze the relationship between predictor variables and counts, especially when the event count data follow a Poisson distribution.

In R, you can fit a Poisson regression model using the glm() function. The interpretation of model coefficients is meaningful when we are interested in understanding how predictor variables influence the log of expected counts. However, if overdispersion (where variance exceeds the mean) is detected, alternative models such as Negative Binomial regression may provide better results.

The tutorial covers both the theoretical foundation of Poisson regression and practical steps to implement, evaluate, and interpret it in R, providing users with a solid framework for working with count data in various applications.

References

Session Info

Code

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] flextable_0.9.7        sandwich_3.1-1         epiDisplay_3.5.0.2    
 [4] nnet_7.3-20            MASS_7.3-64            survival_3.8-3        
 [7] foreign_0.8-88         ggpmisc_0.6.1          ggpp_0.5.8-1          
[10] Metrics_0.1.4          patchwork_1.3.0        ggeffects_2.2.0       
[13] marginaleffects_0.25.0 margins_0.3.28         jtools_2.3.0          
[16] performance_0.13.0     gtsummary_2.1.0        rstatix_0.7.2         
[19] dlookr_0.6.3           DataExplorer_0.8.3     plyr_1.8.9            
[22] lubridate_1.9.4        forcats_1.0.0          stringr_1.5.1         
[25] dplyr_1.1.4            purrr_1.0.4            readr_2.1.5           
[28] tidyr_1.3.1            tibble_3.2.1           ggplot2_3.5.1         
[31] tidyverse_2.0.0       

loaded via a namespace (and not attached):
  [1] splines_4.4.2           later_1.4.1             datawizard_1.0.2       
  [4] pagedown_0.22           lifecycle_1.0.4         globals_0.16.3         
  [7] lattice_0.22-5          vroom_1.6.5             insight_1.1.0          
 [10] backports_1.5.0         magrittr_2.0.3          sass_0.4.9             
 [13] rmarkdown_2.29          yaml_2.3.10             httpuv_1.6.15          
 [16] zip_2.3.2               askpass_1.2.1           multcomp_1.4-28        
 [19] abind_1.4-8             TH.data_1.1-3           prediction_0.3.18      
 [22] gdtools_0.4.1           labelled_2.14.0         reactable_0.4.4        
 [25] ggrepel_0.9.6           listenv_0.9.1           cards_0.5.0.9000       
 [28] MatrixModels_0.5-3      parallelly_1.42.0       commonmark_1.9.2       
 [31] svglite_2.1.3           codetools_0.2-20        xml2_1.3.6             
 [34] tidyselect_1.2.1        farver_2.1.2            broom.mixed_0.2.9.6    
 [37] effectsize_1.0.0        base64enc_0.1-3         broom.helpers_1.20.0   
 [40] showtext_0.9-7          jsonlite_1.9.0          Formula_1.2-5          
 [43] emmeans_1.11.0          systemfonts_1.2.1       tools_4.4.2            
 [46] ragg_1.3.3              hrbrthemes_0.8.7        Rcpp_1.0.14            
 [49] glue_1.8.0              gridExtra_2.3           Rttf2pt1_1.3.12        
 [52] xfun_0.51               mgcv_1.9-1              withr_3.0.2            
 [55] fastmap_1.2.0           SparseM_1.84-2          openssl_2.3.2          
 [58] digest_0.6.37           timechange_0.3.0        R6_2.6.1               
 [61] mime_0.12               estimability_1.5.1      textshaping_1.0.0      
 [64] colorspace_2.1-1        networkD3_0.4           markdown_1.13          
 [67] see_0.11.0              utf8_1.2.4              generics_0.1.3         
 [70] fontLiberation_0.1.0    data.table_1.17.0       report_0.6.1           
 [73] htmlwidgets_1.6.4       parameters_0.24.2       pkgconfig_2.0.3        
 [76] gtable_0.3.6            furrr_0.3.1             htmltools_0.5.8.1      
 [79] fontBitstreamVera_0.1.1 carData_3.0-5           sysfonts_0.8.9         
 [82] scales_1.3.0            kableExtra_1.4.0        knitr_1.49             
 [85] rstudioapi_0.17.1       tzdb_0.4.0              uuid_1.2-1             
 [88] coda_0.19-4.1           nlme_3.1-166            curl_6.2.1             
 [91] showtextdb_3.0          zoo_1.8-13              sjlabelled_1.2.0       
 [94] parallel_4.4.2          extrafont_0.19          pillar_1.10.1          
 [97] grid_4.4.2              vctrs_0.6.5             promises_1.3.2         
[100] car_3.1-3               xtable_1.8-4            extrafontdb_1.0        
[103] evaluate_1.0.3          mvtnorm_1.3-3           cli_3.6.4              
[106] compiler_4.4.2          rlang_1.1.5             crayon_1.5.3           
[109] labeling_0.4.3          stringi_1.8.4           pander_0.6.5           
[112] viridisLite_0.4.2       munsell_0.5.1           bayestestR_0.15.2      
[115] quantreg_6.1            fontquiver_0.2.1        Matrix_1.7-1           
[118] hms_1.1.3               bit64_4.6.0-1           future_1.34.0          
[121] shiny_1.10.0            haven_2.5.4             gt_0.11.1              
[124] igraph_2.1.4            broom_1.0.7             bit_4.5.0.1            
[127] officer_0.6.7           polynom_1.4-1