6.2. Poisson Regression Model with Offset (Rate Data)

In many real-world scenarios, we’re interested not just in the count of occurrences of an event but also in the rate at which the event occurs over a specific exposure period (e.g., the number of road accidents per mile or cases of a disease per population over a set timeframe). This type of data is often called “rate data” and is particularly well-suited to Poisson regression with an offset. An offset variable allows us to account for the varying exposure times or population sizes in our data, making it possible to model the rate of events rather than simply the count.

This tutorial will dive into Poisson regression with an offset, which is essential when analyzing rate data. We’ll start with a model overview, discussing the purpose and assumptions of Poisson regression with an offset term. We’ll then move on to fitting the model both from scratch (to gain insights into how it works) and with R’s convenient glm() function. Following this, we’ll examine model diagnosis techniques to ensure the model fits the data well, discuss model evaluation methods, and conclude with the interpretation of incidence rate ratios (IRRs), which help quantify the predictors’ effect on the rate outcome.

Overview

Poisson regression is commonly used to model rates instead of simple counts in practical applications. For example, if you want to model the rate of events happening over time, space, or another unit of exposure (such as the number of accidents per mile driven or the number of diseases per population), you need to consider the varying levels of exposure across different observations. This is where the concept of an offset comes into play. An offset is a known quantity (the exposure) that adjusts the model to accurately represent the rate of event occurrence.

What is an Offset?

The offset in Poisson regression represents the logarithm of the exposure variable. The exposure variable is the denominator in the rate calculation (e.g., time, population, or space).
The offset allows us to model the number of events per unit of exposure, rather than just raw counts.

The standard Poisson regression model for counts is:

\[ \log(\lambda_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dots + \beta_p X_{pi} \]

Where:

\(\lambda_i\) is the expected count for the \(i^{th}\) observation.
\(X_{1i}, X_{2i}, \dots, X_{pi}\) are the predictor variables.
\(beta_0, \beta_1, \dots, \beta_p\) are the regression coefficients.

In the Poisson regression with offset, we modify the model to incorporate the exposure variable \(e_i\). The offset term is the log of the exposure, ensuring that the response variable is modeled as a rate rather than a simple count.

The model becomes:

\[ log(\lambda_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dots + \beta_p X_{pi} + \log(e_i) \]

Or equivalently:

\[ lambda_i = e_i \exp(\beta_0 + \beta_1 X_{1i} + \dots + \beta_p X_{pi}) \]

Where:

\(e_i\) is the exposure (e.g., time, population size, or area).
\(\lambda_i\) is now the expected rate of events per unit of exposure.
\(\log(e_i)\) is added as a known term (the offset) and is not estimated as part of the model.

Why Use an Offset?

The offset is critical when: 1. Exposure varies across observations: In many datasets, each observation may have a different level of exposure (e.g., different time periods, population sizes, or geographic areas). The offset adjusts for these differences. 2. Modeling rates: Poisson regression with an offset models rates (e.g., the number of events per unit of exposure) rather than raw counts.

Without the offset, the model would incorrectly treat counts as if all observations had the same exposure. These models help address the limitations of standard Poisson regression in various practical situations, providing more accurate results when data deviate from the assumptions of the basic Poisson model.

Poisson Model from Scratch

Let’s go through the process of fitting a Poisson model with an offset and four predictors manually in R. Here’s the breakdown:

Create a Data: Load or simulate count data with four predictors and an offset.
Specify the Poisson Model with Offset: Write the log-likelihood function to include the offset.
Optimize the Parameters: Use the optim() function to maximize the log-likelihood.
Create a Summary Statistics Table: Calculate standard errors and create a summary table.
Check for Overdispersion: Calculate the dispersion statistic to check for overdispersion.

Create a Data

For this example, we’ll simulate some count data, including four predictors and an offset variable.

Code

set.seed(123)
n <- 100  # sample size

# Generate predictors
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
offset <- log(1 + rnorm(n, mean = 5, sd = 1))  # Offset term

# Generate response variable y using a true model with coefficients
beta <- c(0.5, 0.3, -0.2, 0.4, 0.1)
lambda <- exp(beta[1] + beta[2] * x1 + beta[3] * x2 + beta[4] * x3 + beta[5] * x4 + offset)
y <- rpois(n, lambda)

# Combine into a data frame
data <- data.frame(y, x1, x2, x3, x4, offset)
head(data)

   y          x1          x2         x3         x4   offset
1 18 -0.56047565 -0.71040656  2.1988103 -0.7152422 1.779424
2 14 -0.23017749  0.25688371  1.3124130 -0.7526890 1.575126
3 11  1.55870831 -0.24669188 -0.2651451 -0.9385387 1.679943
4 10  0.07050839 -0.34754260  0.5431941 -1.0525133 1.786941
5 10  0.12928774 -0.95161857 -0.4143399 -0.4371595 1.897724
6  9  1.71506499 -0.04502772 -0.4762469  0.3311792 1.470050

Specify the Poisson Model with Offset

The Poisson regression model with an offset can be specified as follows:

\[ \log(\lambda) = \beta_0 + \beta_1 \cdot x1 + \beta_2 \cdot x2 + \beta_3 \cdot x3 + \beta_4 \cdot x4 + \text{offset} \]

Here’s the log-likelihood function for this model:

Code

# Log-likelihood function for Poisson regression with offset
poisson_log_likelihood_offset <- function(params) {
  beta0 <- params[1]
  beta1 <- params[2]
  beta2 <- params[3]
  beta3 <- params[4]
  beta4 <- params[5]
  
  # Compute lambda (expected counts)
  lambda <- exp(beta0 + beta1 * data$x1 + beta2 * data$x2 + beta3 * data$x3 + beta4 * data$x4 + data$offset)
  
  # Log-likelihood for Poisson with offset
  log_likelihood <- sum(dpois(data$y, lambda, log = TRUE))
  
  return(-log_likelihood)  # Negative for minimization
}

Optimize the Parameters

Now we’ll use optim() to estimate the parameters by maximizing the log-likelihood function.

Code

# Initial guesses for beta coefficients
initial_params <- rep(0, 5)

# Use optim to find MLEs for the coefficients
fit <- optim(par = initial_params, fn = poisson_log_likelihood_offset, hessian = TRUE)

# Extract parameter estimates and standard errors
coefficients <- fit$par
std_errors <- sqrt(diag(solve(fit$hessian)))

# Display estimated coefficients
cat("Estimated coefficients:\n", coefficients, "\n")

Estimated coefficients:
 0.4782902 0.3155381 -0.1964496 0.4065897 0.1329681

Code

cat("Standard errors:\n", std_errors, "\n")

Standard errors:
 0.03480884 0.03106314 0.03297209 0.03000168 0.02880428

Create a Summary Statistics Table

To summarize the results, let’s calculate the Z-scores and p-values for each coefficient.

Code

# Calculate Z-scores and p-values
z_scores <- coefficients / std_errors
p_values <- 2 * (1 - pnorm(abs(z_scores)))

# Create summary table
summary_table <- data.frame(
  Coefficient = coefficients,
  Std_Error = std_errors,
  Z_value = z_scores,
  P_value = p_values
)

# Assign row names for clarity
row.names(summary_table) <- c("Intercept", "x1", "x2", "x3", "x4")

# Display summary table
print(summary_table)

          Coefficient  Std_Error   Z_value      P_value
Intercept   0.4782902 0.03480884 13.740481 0.000000e+00
x1          0.3155381 0.03106314 10.157957 0.000000e+00
x2         -0.1964496 0.03297209 -5.958056 2.552555e-09
x3          0.4065897 0.03000168 13.552230 0.000000e+00
x4          0.1329681 0.02880428  4.616262 3.907138e-06

Check for Overdispersion

To check for overdispersion, calculate the dispersion statistic, which is the ratio of the residual deviance to the degrees of freedom.

Code

# estimaates
estimates <- fit$par
names(estimates ) <- c("beta0", "beta1", "beta2", "beta3", "beta4")
# Calculate fitted values
lambda_hat <- exp(estimates[1] + estimates[2] * data$x1 + estimates[3] * data$x2 +
                  estimates[4] * data$x3 + estimates[5] * data$x4 + data$offset)


# Calculate Pearson residuals
pearson_residuals <- (data$y - lambda_hat) / sqrt(lambda_hat)

# Calculate dispersion statistic
dispersion_statistic <- sum(pearson_residuals^2) / (n - length(estimates))

# Display dispersion statistic
cat("Dispersion Statistic:", dispersion_statistic, "\n")

Dispersion Statistic: 0.9083263

Code

# If the dispersion statistic is much greater than 1, overdispersion is present
if (dispersion_statistic > 1.2) {
  cat("Warning: Evidence of overdispersion\n")
} else {
  cat("No evidence of overdispersion\n")
}

No evidence of overdispersion

Poisson Model with R

In this exercise we will develop a Poisson regression model with offset in R with a built in function glm()to explain the variability the diagnosed diabetes rate per county in the USA.

Check and Install Required R packages

Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:

Code

packages <- c('tidyverse',
     'plyr',
      'DataExplorer',
         'dlookr',
         'rstatix',
         'gtsummary',
         'performance',
         'jtools',
         'margins',
         'marginaleffects',
         'ggeffects',
         'patchwork',
         'Metrics',
         'ggpmisc',
         'epiDisplay',
         'sandwich'
          )

#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Load R-packages

Code

# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Code

# Check loaded packages
cat("Successfully loaded packages:\n")

Successfully loaded packages:

Code

print(search()[grepl("package:", search())])

 [1] "package:sandwich"        "package:epiDisplay"     
 [3] "package:nnet"            "package:MASS"           
 [5] "package:survival"        "package:foreign"        
 [7] "package:ggpmisc"         "package:ggpp"           
 [9] "package:Metrics"         "package:patchwork"      
[11] "package:ggeffects"       "package:marginaleffects"
[13] "package:margins"         "package:jtools"         
[15] "package:performance"     "package:gtsummary"      
[17] "package:rstatix"         "package:dlookr"         
[19] "package:DataExplorer"    "package:plyr"           
[21] "package:lubridate"       "package:forcats"        
[23] "package:stringr"         "package:dplyr"          
[25] "package:purrr"           "package:readr"          
[27] "package:tidyr"           "package:tibble"         
[29] "package:ggplot2"         "package:tidyverse"      
[31] "package:stats"           "package:graphics"       
[33] "package:grDevices"       "package:utils"          
[35] "package:datasets"        "package:methods"        
[37] "package:base"

Data

The County-level age-adjusted number and rate of diabetes patients, prevalence of obesity, physical inactivity and Food environment index for the year 2016-2020 were obtained from United States Diabetes Surveillance System (USDSS).

Full data set is available for download from my Dropbox or from my Github accounts.

Dataset contains five years average (2016-2020) of following variables :

Diabetes_count - Diabetes number per county (Diabetes Surveillance System (USDSS))
Diabetes_per - Diabetes number per county (Diabetes Surveillance System (USDSS))
Urban_Rural - Urban Rural County (USDA)
PPO_total - Total population per county (US Census)
Obesity - % obesity per county (Behavioral Risk Factor Surveillance System)
Physical_Inactivity: % adult access to exercise opportunities (County Health Ranking)
SVI - Level of social vulnerability in the county relative to other counties in the nation or within the state.ocial vulnerability refers to the potential negative effects on communities caused by external stresses on human health. The CDC/ATSDR Social vulnerability Index (SVI) ranks all US counties on 15 social factors, including poverty, lack of vehicle access, and crowded housing, and groups them into four related themes. ( CDC/ATSDR Social Vulnerability Index (SVI))
Food_Env_Index: Measure of access to healthy food. The Food Environment Index ranges from a scale of 0 (worst) to 10 (best) and equally weights two indicators: 1) Limited access to healthy foods based on distance an individual lives from a grocery store or supermarket, locations for healthy food purchases in most communities; and 2) Food insecurity defined as the inability to access healthy food because of cost barriers.County Health Ranking

We will use read_csv() function of {readr} package to import data as a {tidy} data.

Code

# load data
mf<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/Regression_analysis/county_data_2016_2020.csv")

Rows: 3107 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): State, County, Urban_Rural
dbl (12): FIPS, X, Y, POP_Total, Diabetes_count, Diabetes_per, Obesity, Acce...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

# select variables
df<-mf |> 
  dplyr::select(Diabetes_per,
                POP_Total,
                Obesity,
                Physical_Inactivity, 
                Access_Excercise,
                Food_Env_Index,
                SVI,
                Urban_Rural 
                ) |> 
  glimpse()

Rows: 3,107
Columns: 8
$ Diabetes_per        <dbl> 9.24, 8.48, 11.72, 10.08, 10.26, 9.06, 11.80, 13.2…
$ POP_Total           <dbl> 55707.0, 218346.8, 25078.2, 22448.2, 57852.4, 1017…
$ Obesity             <dbl> 29.22, 28.94, 29.34, 29.44, 30.10, 19.86, 30.38, 3…
$ Physical_Inactivity <dbl> 26.42, 22.86, 23.72, 25.38, 24.76, 18.58, 28.66, 2…
$ Access_Excercise    <dbl> 70.8, 72.2, 49.8, 30.6, 24.6, 19.6, 48.0, 51.4, 62…
$ Food_Env_Index      <dbl> 6.9, 7.7, 5.5, 7.6, 8.1, 4.3, 6.5, 6.3, 6.4, 7.7, …
$ SVI                 <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.8310, 0.…
$ Urban_Rural         <chr> "Urban", "Urban", "Rural", "Urban", "Urban", "Rura…

Code

# data processing
df$Diabetes_per<-as.integer(df$Diabetes_per)  
df$Urban_Rural<-as.factor(df$Urban_Rural)

Data Description

The epidisplay package can provide both numerical and categorical statistics simultaneously with the codebook() function. It’s a great tool for descriptive statistics.

Code

epiDisplay::codebook(df[,1:8])


 
 
Diabetes_per     :    

No. of observations = 3107

  Var. name    obs. mean   median  s.d.   min.   max.  
1 Diabetes_per 3107 8.1    8       1.6    3      16    

 ================== 
POP_Total    :    

No. of observations = 3107

  Var. name obs. mean      median  s.d.      min.   max.      
1 POP_Total 3107 104466.38 26060   334048.05 149.8  10077818.6

 ================== 
Obesity      :    

No. of observations = 3107

  Var. name obs. mean   median  s.d.   min.   max.  
1 Obesity   3107 27.59  27.86   4.8    12.06  42.12 

 ================== 
Physical_Inactivity      :    

No. of observations = 3107

  Var. name           obs. mean   median  s.d.   min.   max.  
1 Physical_Inactivity 3107 21.16  20.84   4.3    9.64   36.72 

 ================== 
Access_Excercise     :    

No. of observations = 3107

  Var. name        obs. mean   median  s.d.   min.   max.  
1 Access_Excercise 3107 61.98  64      21.78  0      100   

 ================== 
Food_Env_Index   :    

No. of observations = 3107

  Var. name      obs. mean   median  s.d.   min.   max.  
1 Food_Env_Index 3107 7.32   7.5     1.09   1.6    10    

 ================== 
SVI      :    

No. of observations = 3107

  Var. name obs. mean   median  s.d.   min.   max.  
1 SVI       3107 0.5    0.5     0.29   0      1     

 ================== 
Urban_Rural      :    

No. of observations = 3107

  Var. name   obs. mean   median  s.d.   min.   max.  
1 Urban_Rural                                         

 ==================

Density Plot

Code

ggplot(df, aes(Diabetes_per)) +
  geom_density()+
  # x-axis title
  xlab("Diabetes count per county)") + 
  # y-axis title
  ylab("Density")+
  # plot title
  ggtitle("Kernel Density of Diabetes Rate")+
    theme(
    # Center the plot title
    plot.title = element_text(hjust = 0.5))

Descriptive Statistics

Code

# Standard error
SE <- function(x){
  sd(x)/sqrt(length(x))
}

# Get summary statistics
summarise_diabetes<-plyr::ddply(df,~ Urban_Rural, summarise, 
                Mean= round(mean(Diabetes_per), 2),
                Median=round (median(Diabetes_per ), 2),
                Min= round (min(Diabetes_per),2), 
                Max= round (max(Diabetes_per),2), 
                SD= round(sd(Diabetes_per), 2), 
                SE= round (SE(Diabetes_per ), 3))
# Load library
library(flextable)


Attaching package: 'flextable'

The following object is masked from 'package:jtools':

    theme_apa

The following object is masked from 'package:gtsummary':

    continuous_summary

The following object is masked from 'package:purrr':

    compose

Code

# Create a table
flextable::flextable(summarise_diabetes, theme_fun = theme_booktabs)

Urban_Rural	Mean	Median	Min	Max	SD	SE
Rural	7.75	7	4	16	1.53	0.042
Urban	8.35	8	3	15	1.61	0.038

Barplot - Urban vs Rural

Code

ggplot(summarise_diabetes, aes(x=Urban_Rural, y=Mean)) + 
  geom_bar(stat="identity", position=position_dodge(),width=0.5, fill="gray") +
  geom_errorbar(aes(ymin=Mean-SE, ymax=Mean+SE), width=.2,
   position=position_dodge(.9))+
  # add y-axis title and x-axis title leave blank
  labs(y="% Diabetes per county", x = "")+ 
  # add plot title
  ggtitle("Mean ± SE of Diabetes Rate")+
  coord_flip()+
  # customize plot themes
  theme(
        axis.line = element_line(colour = "gray"),
        # plot title position at center
        plot.title = element_text(hjust = 0.5),
        # axis title font size
        axis.title.x = element_text(size = 14), 
        # X and  axis font size
        axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5, colour='black'),
        axis.text.x = element_text(size=12))

Correlation Plot

Code

df |> 
  # select variables
  dplyr::select (Diabetes_per,
                Obesity,
                Physical_Inactivity, 
                Access_Excercise,
                Food_Env_Index,
                SVI
                ) |>  
  dlookr::correlate() |>
  plot()

Let explore correlation in rural and urban counties:

Code

df |> 
  group_by(Urban_Rural) |> 
  # select variables
  dplyr::select (Diabetes_per,
                Obesity,
                Physical_Inactivity, 
                Access_Excercise,
                Food_Env_Index,
                SVI
                ) |>  
  dlookr::correlate() |>
  plot()

Split Data

We will use the ddply() function of the {plyr} package to split soil carbon datainto homogeneous subgroups using stratified random sampling. This method involves dividing the population into strata and taking random samples from each stratum to ensure that each subgroup is proportionally represented in the sample. The goal is to obtain a representative sample of the population by adequately representing each stratum.

Code

seeds = 11076
tr_prop = 0.70

df$log_POP_Total<-log(df$POP_Total)
# training data (70% data)
train= ddply(df,.(Urban_Rural),
                 function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test = ddply(df, .(Urban_Rural),
            function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)

Code

# Density plot all, train and test data 
ggplot()+
  geom_density(data = df, aes(Diabetes_per))+
  geom_density(data = train, aes(Diabetes_per), color = "green")+
  geom_density(data = test, aes(Diabetes_per ), color = "red") +
      xlab("% Diabetes") + 
     ylab("Density")

Fit a Poisson Model with Offset

We will fit a Poisson regression model using the glm() function in R. We specify family = poisson(link = "log") to indicate that we want to fit a Poisson regression model. Here we model the diabetes rate per county, The offset variable, here is log population per county need to be defined in model. This offset variable adjusts for the differing number of diabetes patients in different population levels per county.

Code

fit.pois <- glm(
            Diabetes_per  ~ 
                Obesity +
                Physical_Inactivity +
                Access_Excercise +
                Food_Env_Index +
                SVI +
                Urban_Rural+
                offset(log_POP_Total),
                family = poisson(link = "log"),
                data = train)

Model Summary

summary() function produce result summaries of the results of model fitting functions.

Code

summary(fit.pois)


Call:
glm(formula = Diabetes_per ~ Obesity + Physical_Inactivity + 
    Access_Excercise + Food_Env_Index + SVI + Urban_Rural + offset(log_POP_Total), 
    family = poisson(link = "log"), data = train)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.1137977  0.0882815 -23.944  < 2e-16 ***
Obesity             -0.0245887  0.0025489  -9.647  < 2e-16 ***
Physical_Inactivity  0.0231530  0.0028876   8.018 1.07e-15 ***
Access_Excercise    -0.0329137  0.0004251 -77.431  < 2e-16 ***
Food_Env_Index      -0.3835297  0.0093947 -40.824  < 2e-16 ***
SVI                 -1.6921128  0.0349250 -48.450  < 2e-16 ***
Urban_RuralUrban    -1.0635240  0.0199055 -53.429  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 42667  on 2172  degrees of freedom
Residual deviance: 13198  on 2166  degrees of freedom
AIC: 21756

Number of Fisher Scoring iterations: 5

report() function of {report} package generate a brief report of fitted model

Code

report::report(fit.pois)

We fitted a poisson model (estimated using ML) to predict Diabetes_per with
Obesity, Physical_Inactivity, Access_Excercise, Food_Env_Index, SVI,
Urban_Rural and log_POP_Total (formula: Diabetes_per ~ Obesity +
Physical_Inactivity + Access_Excercise + Food_Env_Index + SVI + Urban_Rural +
offset(log_POP_Total)). The model's explanatory power is substantial
(Nagelkerke's R2 = 1.00). The model's intercept, corresponding to Obesity = 0,
Physical_Inactivity = 0, Access_Excercise = 0, Food_Env_Index = 0, SVI = 0,
Urban_Rural = Rural and log_POP_Total = 0, is at -2.11 (95% CI [-2.29, -1.94],
p < .001). Within this model:

  - The effect of Obesity is statistically significant and negative (beta =
-0.02, 95% CI [-0.03, -0.02], p < .001; Std. beta = -0.08, 95% CI [-0.10,
-0.05])
  - The effect of Physical Inactivity is statistically significant and positive
(beta = 0.02, 95% CI [0.02, 0.03], p < .001; Std. beta = 0.09, 95% CI [0.06,
0.11])
  - The effect of Access Excercise is statistically significant and negative
(beta = -0.03, 95% CI [-0.03, -0.03], p < .001; Std. beta = -0.46, 95% CI
[-0.47, -0.44])
  - The effect of Food Env Index is statistically significant and negative (beta
= -0.38, 95% CI [-0.40, -0.37], p < .001; Std. beta = -0.28, 95% CI [-0.30,
-0.26])
  - The effect of SVI is statistically significant and negative (beta = -1.69,
95% CI [-1.76, -1.62], p < .001; Std. beta = -0.29, 95% CI [-0.31, -0.27])
  - The effect of Urban Rural [Urban] is statistically significant and negative
(beta = -1.06, 95% CI [-1.10, -1.02], p < .001; Std. beta = -0.72, 95% CI
[-0.75, -0.68])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald z-distribution approximation.

The {jtools} package consists of a series of functions to create summary of the poisson regression model:

Code

jtools::summ(fit.pois)

Observations	2173
Dependent variable	Diabetes_per
Type	Generalized linear model
Family	poisson
Link	log

𝛘²(6)	29468.92
p	0.00
Pseudo-R² (Cragg-Uhler)	1.00
Pseudo-R² (McFadden)	0.58
AIC	21756.41
BIC	21796.20

	Est.	S.E.	z val.	p
(Intercept)	-2.11	0.09	-23.94	0.00
Obesity	-0.02	0.00	-9.65	0.00
Physical_Inactivity	0.02	0.00	8.02	0.00
Access_Excercise	-0.03	0.00	-77.43	0.00
Food_Env_Index	-0.38	0.01	-40.82	0.00
SVI	-1.69	0.03	-48.45	0.00
Urban_RuralUrban	-1.06	0.02	-53.43	0.00
Standard errors: MLE

We utilized the R package {sandwich} to obtain the robust standard errors and subsequently calculated the p-values. Additionally, we computed the 95% confidence interval using the parameter estimates and their robust standard errors.

Code

library(sandwich)
cov.m1 <- vcovHC(fit.pois, type="HC0")
std.err <- sqrt(diag(cov.m1))
r.est <- cbind(Estimate= coef(fit.pois), "Robust SE" = std.err,
"Pr(>|z|)" = 2 * pnorm(abs(coef(fit.pois)/std.err), lower.tail=FALSE),
LL = coef(fit.pois) - 1.96 * std.err,
UL = coef(fit.pois) + 1.96 * std.err)

r.est

                       Estimate   Robust SE      Pr(>|z|)           LL
(Intercept)         -2.11379767 0.234805265  2.209498e-19 -2.574015991
Obesity             -0.02458873 0.008547439  4.018105e-03 -0.041341712
Physical_Inactivity  0.02315297 0.008852013  8.908130e-03  0.005803024
Access_Excercise    -0.03291368 0.001185037 8.845286e-170 -0.035236351
Food_Env_Index      -0.38352972 0.027997939  1.036458e-42 -0.438405677
SVI                 -1.69211275 0.120184605  5.092571e-45 -1.927674577
Urban_RuralUrban    -1.06352403 0.042747135 1.242722e-136 -1.147308415
                              UL
(Intercept)         -1.653579352
Obesity             -0.007835752
Physical_Inactivity  0.040502913
Access_Excercise    -0.030591006
Food_Env_Index      -0.328653756
SVI                 -1.456550926
Urban_RuralUrban    -0.979739645

Goodness of fit test of poisson can be explored by poisgof() function of {epiDisplay} package:

Code

epiDisplay::poisgof(fit.pois)

$results
[1] "Goodness-of-fit test for Poisson assumption"

$chisq
[1] 13198

$df
[1] 2166

$p.value
[1] 0

Model Performance

Code

performance::performance(fit.pois)

# Indices of model performance

AIC       |      AICc |       BIC | Nagelkerke's R2 |  RMSE | Sigma | Score_log | Score_spherical
-------------------------------------------------------------------------------------------------
21756.409 | 21756.461 | 21796.196 |           1.000 | 8.610 | 1.000 |    -5.003 |           0.015

Note

Nagelkerke’s \(R^2\), also known as the Nagelkerke pseudo-\(R^2\), is a measure of the proportion of variance explained by a logistic regression model. It is an adaptation of Cox and Snell’s \(R^2\) to overcome its limitation of having a maximum value less than 1. Nagelkerke’s \(R^2\) ranges from 0 to 1 and provides a measure of the overall fit of the logistic regression model.

Mathematically, Nagelkerke’s R^2 is defined as:

\[ R^2_{\text{Nagelkerke}} = 1 - \left(\frac{-2 \cdot \text{Log-Likelihood}_{\text{model}}}{\text{Log-Likelihood}_{\text{null model}}} \right)^{\frac{2}{n}} \]

where:

Log-Likelihood_model is the log-likelihood of the fitted logistic regression model.
Log-Likelihood_null model is the log-likelihood of the null model (a logistic regression model with only the intercept term).
n is the total number of observations in the dataset.

Nagelkerke’s \(R^2\) provides a useful measure to evaluate the goodness of fit of a logistic regression model, but it should be interpreted with caution, especially when the model has categorical predictors or interactions. Additionally, like other R^2 measures, Nagelkerke’s \(R^2\) does not indicate the quality of predictions made by the model.

Model Diagnostics

The package {performance} provides many functions to check model assumptions, like check_overdispersion(), check_zeroinflation().

Check for Overdispersion

Overdispersion occurs when the observed variance in the data is higher than the expected variance from the model assumption (for Poisson, variance roughly equals the mean of an outcome). check_overdispersion() checks if a count model (including mixed models) is overdispersed or not.

Code

performance::check_overdispersion(fit.pois)

# Overdispersion test

       dispersion ratio =     9.842
  Pearson's Chi-Squared = 21318.216
                p-value =   < 0.001

Overdispersion detected.

Overdispersion can be fixed by either modelling the dispersion parameter (not possible with all packages), or by choosing a different distributional family (like Quasi-Poisson, or negative binomial, see (Gelman and Hill 2007)).

Check for Zero-inflation

Zero-inflation is indicated when the amount of observed zeros is larger than the amount of predicted zeros, so the model is underfitting zeros. In such cases, it is recommended to use negative binomial or zero-inflated models.

Use check_zeroinflation() to check if zero-inflation is present in the fitted model.

Code

performance::check_zeroinflation(fit.pois)

Model has no observed zeros in the response variable.

NULL

Check for Singular Model Fits

A “singular” model fit means that some dimensions of the variance-covariance matrix have been estimated as exactly zero. This often occurs for mixed models with overly complex random effects structures.

check_singularity() checks mixed models (of class lme, merMod, glmmTMB or MixMod) for singularity, and returns TRUE if the model fit is singular.

Code

check_singularity(fit.pois)

[1] FALSE

Visualization of Model Assumption

To get a comprehensive check and visualization, use check_model().

Code

performance::check_model(fit.pois)

Incidence Rate Ratio (IRR)

The Incidence Rate Ratio (IRR) is a measure commonly used in epidemiology and other fields to quantify the association between an exposure or predictor variable and an outcome, particularly when dealing with count data. It is often used in the context of Poisson regression models.

In Poisson regression, the exponentiated coefficients (i.e., exponentiated regression coefficients) are interpreted as Incidence Rate Ratios. Specifically, for a given predictor variable, the IRR represents the multiplicative change in the rate of the outcome for each unit change in the predictor variable.

Mathematically, if \(\beta\) is the coefficient estimate of a predictor variable in a Poisson regression model, then the corresponding IRR, denoted as ( ), is calculated as:

\[ \text{IRR} = e^{\beta} \]

where \(e\) is the base of the natural logarithm (approximately equal to 2.718).

Interpretation of the IRR:

If \(text{IRR} = 1\), it implies that there is no association between the predictor variable and the outcome.
If \(\text{IRR} > 1\), it indicates that an increase in the predictor variable is associated with an increased incidence rate (or risk) of the outcome.
If \(\text{IRR} < 1\), it suggests that an increase in the predictor variable is associated with a decreased incidence rate (or risk) of the outcome.

For example, if the IRR associated with a particular exposure is 1.5, it means that the incidence rate of the outcome is 1.5 times higher in the exposed group compared to the unexposed group, all else being equal.

The IRR provides a convenient way to quantify and interpret the strength of association between predictor variables and outcomes in Poisson regression models, particularly when dealing with count data and incidence rates.

Code

model_01_IRR = tidy(fit.pois, exponentiate = TRUE, 
                       conf.int = TRUE)
model_01_IRR

# A tibble: 7 × 7
  term                estimate std.error statistic   p.value conf.low conf.high
  <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)            0.121  0.0883      -23.9  1.07e-126    0.102     0.144
2 Obesity                0.976  0.00255      -9.65 5.07e- 22    0.971     0.981
3 Physical_Inactivity    1.02   0.00289       8.02 1.07e- 15    1.02      1.03 
4 Access_Excercise       0.968  0.000425    -77.4  0            0.967     0.968
5 Food_Env_Index         0.681  0.00939     -40.8  0            0.669     0.694
6 SVI                    0.184  0.0349      -48.4  0            0.172     0.197
7 Urban_RuralUrban       0.345  0.0199      -53.4  0            0.332     0.359

The tbl_regression() function from the {gtsummary} package takes a regression model object as input and produces a formatted table with IRR and confidence interval.

Code

gtsummary::tbl_regression(fit.pois, exponentiate = TRUE)

Characteristic	IRR	95% CI	p-value
Obesity	0.98	0.97, 0.98	<0.001
Physical_Inactivity	1.02	1.02, 1.03	<0.001
Access_Excercise	0.97	0.97, 0.97	<0.001
Food_Env_Index	0.68	0.67, 0.69	<0.001
SVI	0.18	0.17, 0.20	<0.001
Urban_Rural
Rural	—	—
Urban	0.35	0.33, 0.36	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Based on this table, we may interpret the results as follows:

Urban counties have an high number of diabetes with an IRR of 0.39 (95% CI: 0.33, 0.36, while controlling for the effect other variables.
An increase in Obesity one mark increases the risk of having higher number diabetes patients by 0.98 (95% CI: 0.97 0.98), while while controlling for the effect other variables

Marginal Effects and Adjusted Predictions

{ggeffects} supports labelled data and the plot()- method automatically sets titles, axis - and legend-labels depending on the value and variable labels of the data

Code

res <- predict_response(fit.pois, terms = c("Obesity", "Urban_Rural"), condition = "log_POP_Total")
plot(res, facets = TRUE)

effect_plot() function of {jtools} package plot simple effects in poisson regression models:

Code

p1<-jtools::effect_plot(fit.pois, 
                    main.title = "Obesity", 
                    pred = Obesity, 
                    interval = T,  
                    outcome.scale = "link",
                    partial.residuals = F)
p2<-jtools::effect_plot(fit.pois, 
                    main.title = "Pysically inactive adults ", 
                    pred = Physical_Inactivity, 
                    interval = TRUE,  
                    outcome.scale = "link",
                    partial.residuals = F)
p3<-jtools::effect_plot(fit.pois, 
                    main.title = "Access to Excercise", 
                    pred = Access_Excercise , 
                    interval = TRUE,  
                    outcome.scale = "link",
                    partial.residuals = F)
p4<-jtools::effect_plot(fit.pois,
                    main.title = "Food Env Index", 
                    pred = Food_Env_Index, 
                    interval = TRUE,  
                    outcome.scale = "link",
                    partial.residuals = F)
library(patchwork)
(p1+p2)/(p3 +p4)

Prediction Performance

The predict() function will be used to predict the number of diabetes patients the test counties. This will help to validate the accuracy of the these regression model.

Code

test$Pred.diabetes<-predict(fit.pois, test, type = "response")
Metrics::rmse(test$Diabetes_per, test$Pred.diabetes)

[1] 13.34734

Code

Metrics::mae(test$Diabetes_per, test$Pred.diabetes)

[1] 5.774231

Summary and Conclusion

The tutorial explained how to fit a Poisson regression model with offset variables in R, interpret the results, and assess model performance through residual analysis, goodness-of-fit statistics, and predictive accuracy measures like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

Poisson regression with offset variables is an essential extension when modeling rates instead of raw counts. The offset adjusts the model for different levels of exposure across observations, ensuring that the model correctly reflects the rate of events. This technique is particularly useful in applications such as epidemiology, transportation, and insurance, where exposure varies across units.

R provides a straightforward approach to fitting and assessing Poisson regression models with offsets using the glm() function. Additionally, various diagnostic tools—such as residual plots, goodness-of-fit tests, and predictive performance measures—can help evaluate how well the model fits the data. When overdispersion is detected, alternatives like Negative Binomial regression should be considered. This tutorial highlighted both the theoretical and practical aspects of applying Poisson regression with offset variables, offering a comprehensive framework for handling count data with varying exposures in R.

References

Session Info

Code

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] flextable_0.9.7        sandwich_3.1-1         epiDisplay_3.5.0.2    
 [4] nnet_7.3-20            MASS_7.3-64            survival_3.8-3        
 [7] foreign_0.8-88         ggpmisc_0.6.1          ggpp_0.5.8-1          
[10] Metrics_0.1.4          patchwork_1.3.0        ggeffects_2.2.0       
[13] marginaleffects_0.25.0 margins_0.3.28         jtools_2.3.0          
[16] performance_0.13.0     gtsummary_2.1.0        rstatix_0.7.2         
[19] dlookr_0.6.3           DataExplorer_0.8.3     plyr_1.8.9            
[22] lubridate_1.9.4        forcats_1.0.0          stringr_1.5.1         
[25] dplyr_1.1.4            purrr_1.0.4            readr_2.1.5           
[28] tidyr_1.3.1            tibble_3.2.1           ggplot2_3.5.1         
[31] tidyverse_2.0.0       

loaded via a namespace (and not attached):
  [1] splines_4.4.2           later_1.4.1             datawizard_1.0.2       
  [4] pagedown_0.22           lifecycle_1.0.4         globals_0.16.3         
  [7] lattice_0.22-5          vroom_1.6.5             insight_1.1.0          
 [10] backports_1.5.0         magrittr_2.0.3          sass_0.4.9             
 [13] rmarkdown_2.29          yaml_2.3.10             httpuv_1.6.15          
 [16] zip_2.3.2               askpass_1.2.1           multcomp_1.4-28        
 [19] abind_1.4-8             TH.data_1.1-3           prediction_0.3.18      
 [22] gdtools_0.4.1           labelled_2.14.0         reactable_0.4.4        
 [25] ggrepel_0.9.6           listenv_0.9.1           cards_0.5.0.9000       
 [28] MatrixModels_0.5-3      parallelly_1.42.0       commonmark_1.9.2       
 [31] svglite_2.1.3           codetools_0.2-20        xml2_1.3.6             
 [34] tidyselect_1.2.1        farver_2.1.2            broom.mixed_0.2.9.6    
 [37] effectsize_1.0.0        base64enc_0.1-3         broom.helpers_1.20.0   
 [40] showtext_0.9-7          jsonlite_1.9.0          Formula_1.2-5          
 [43] emmeans_1.11.0          systemfonts_1.2.1       tools_4.4.2            
 [46] ragg_1.3.3              hrbrthemes_0.8.7        Rcpp_1.0.14            
 [49] glue_1.8.0              gridExtra_2.3           Rttf2pt1_1.3.12        
 [52] xfun_0.51               mgcv_1.9-1              withr_3.0.2            
 [55] fastmap_1.2.0           SparseM_1.84-2          openssl_2.3.2          
 [58] digest_0.6.37           timechange_0.3.0        R6_2.6.1               
 [61] mime_0.12               estimability_1.5.1      textshaping_1.0.0      
 [64] colorspace_2.1-1        networkD3_0.4           markdown_1.13          
 [67] see_0.11.0              utf8_1.2.4              generics_0.1.3         
 [70] fontLiberation_0.1.0    data.table_1.17.0       report_0.6.1           
 [73] htmlwidgets_1.6.4       parameters_0.24.2       pkgconfig_2.0.3        
 [76] gtable_0.3.6            furrr_0.3.1             htmltools_0.5.8.1      
 [79] fontBitstreamVera_0.1.1 carData_3.0-5           sysfonts_0.8.9         
 [82] scales_1.3.0            kableExtra_1.4.0        knitr_1.49             
 [85] rstudioapi_0.17.1       tzdb_0.4.0              uuid_1.2-1             
 [88] coda_0.19-4.1           nlme_3.1-166            curl_6.2.1             
 [91] showtextdb_3.0          zoo_1.8-13              sjlabelled_1.2.0       
 [94] parallel_4.4.2          extrafont_0.19          pillar_1.10.1          
 [97] grid_4.4.2              vctrs_0.6.5             promises_1.3.2         
[100] car_3.1-3               xtable_1.8-4            extrafontdb_1.0        
[103] evaluate_1.0.3          mvtnorm_1.3-3           cli_3.6.4              
[106] compiler_4.4.2          rlang_1.1.5             crayon_1.5.3           
[109] labeling_0.4.3          stringi_1.8.4           pander_0.6.5           
[112] viridisLite_0.4.2       munsell_0.5.1           bayestestR_0.15.2      
[115] quantreg_6.1            fontquiver_0.2.1        Matrix_1.7-1           
[118] hms_1.1.3               bit64_4.6.0-1           future_1.34.0          
[121] shiny_1.10.0            haven_2.5.4             gt_0.11.1              
[124] igraph_2.1.4            broom_1.0.7             bit_4.5.0.1            
[127] officer_0.6.7           polynom_1.4-1