Generalized Linear Models

Generalized Linear Models (GLMs) are a versatile class of models that extend linear regression to handle a variety of response variable distributions and relationships. In R, GLMs are implemented through the glm() function and related packages, providing a powerful framework for analyzing both continuous and categorical data across a wide range of contexts. This tutorial introduces several types of GLMs, as well as related models, and demonstrates how to implement each in R.

Introduction to Generalized Linear Models

The Generalized Linear Model (GLM) is a sophisticated extension of linear regression designed to model relationships between a dependent variable and independent variables when the underlying assumptions of linear regression are unmet. The GLM was first introduced by Sir John Nelder and Robert Wedderburn, both acclaimed statisticians, in 1972.

The GLM is an essential tool in modern data analysis, as it can be used to model a wide range of data types that may not conform to the assumptions of traditional linear regression. It allows for modeling non-normal distributions, non-linear relationships, and correlations between observations. By utilizing maximum likelihood estimation (MLE), the GLM can also handle missing data and provide accurate estimates even when some observations are missing. This makes it a valuable tool in business and academia, where the ability to model complex relationships accurately is essential.

The GLM is a powerful and flexible tool integral to modern data analysis. Its ability to model complex relationships between variables and handle missing data has made it a valuable asset in business and academia.

Note

Maximum Likelihood Estimation (MLE) is a statistical technique used to estimate the parameters of a model by analyzing the observed data. This method involves finding the optimal values for the model parameters by maximizing the likelihood function. The likelihood function measures how well the model can explain the observed data. The higher the likelihood function, the more accurate the model explains the data. MLE is widely used in fields such as finance, economics, and engineering to create models that can predict future outcomes based on the available data.

Key features of Generalized Linear Models

Link Function: GLMs are characterized by a link function that connects the linear predictor, a combination of independent variables, to the mean of the dependent variable. This connection enables the estimation of the relationship between independent and dependent variables in a non-linear fashion.

The selection of a link function in GLMs is contingent upon the nature of the data and the distribution of the response variable. The identity link function is utilized when the continuous response variable follows a normal distribution. The logit link function is employed when the response variable is binary, meaning it can only take on two values and follows a binomial distribution. The log link function is utilized when the response variable is count data and follows a Poisson distribution.

Choosing an appropriate link function is a crucial aspect of modeling, as it impacts the interpretation of the estimated coefficients for independent variables. Therefore, a thorough understanding of the nature of the data and the response variable’s distribution is necessary when selecting a link function.

Distribution Family: Unlike linear regression, which assumes a normal distribution for the residuals, GLMs allow for a variety of probability distributions for the response variable. The choice of distribution is based on the characteristics of the data. Commonly used distributions include:
- Normal distribution (Gaussian): For continuous data.
- Binomial distribution: For binary or dichotomous data.
- Poisson distribution: For count data.
- Gamma distribution: For continuous, positive, skewed data.
Variance Function: GLMs accommodate heteroscedasticity (unequal variances across levels of the independent variables) by allowing the variance of the response variable to be a function of the mean.
Deviance: Instead of using the sum of squared residuals as in linear regression, GLMs use deviance to measure lack of fit. Deviance compares the fit of the model to a saturated model (a model that perfectly fits the data).

The mathematical expression of a Generalized Linear Model (GLM) involves the linear predictor, the link function, and the probability distribution of the response variable.

Here’s the general form of a GLM:

Linear Predictor (η):

\[ \eta = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_kx_k \]

where:

\(\eta\) is the linear predictor,
\(\beta_0, \beta_1, \ldots, \beta_k\) are the coefficients,
\(x_1, x_2, \ldots, x_k\) are the independent variables.

Link Function (g):

\[ g(\mu) = \eta \]

The link function connects the linear predictor to the mean of the response variable. It transforms the mean (μ) to the linear predictor (η). Common link functions include:

Identity link (for normal distribution):

\[ g(\mu) = \mu \]

Logit link (for binary data in logistic regression):

\[ g(\mu) = log(\frac{\mu}{1-\mu}) \]

Log link(for Poisson regression):

\[ g(\mu) = \log(\mu )\]

Probability Distribution: The response variable follows a probability distribution from the exponential family. The distribution is chosen based on the nature of the data. Common choices include:
- Normal distribution (Gaussian) for continuous data.
- Binomial distribution for binary or dichotomous data.
- Poisson distribution for count data.
- Gamma distribution for continuous, positive, skewed data.

Putting it all together, the probability mass function (PMF) or probability density function (PDF) for the response variable (Y) is expressed as:

\[ f(y;\theta,\phi) = \exp\left(\frac{y\theta - b(\theta)}{a(\phi)} + c(y,\phi)\right) \]

where:

f(y;θ,ϕ) is the PMF or PDF,
θ is the natural parameter,
ϕ is the dispersion parameter,
a(ϕ), b(θ), c(y,ϕ) are known functions.

Linear Regression vs Generalized Linear Models

The primary difference between linear models (LM) and generalized linear models (GLM) is in their flexibility to handle different types of response variables and error distributions. Here’s a breakdown of the key distinctions:

1. Type of Response Variable

LM (Linear Model): Assumes that the response variable is continuous and normally distributed. For example, predicting a continuous variable like height or weight.
GLM (Generalized Linear Model): Extends linear models to accommodate response variables that are not normally distributed, such as binary outcomes (0 or 1), counts, or proportions. GLMs can handle a variety of distributions (e.g., binomial, Poisson).

2. Link Function

LM: The relationship between the predictor variables and the response is assumed to be linear, with an identity link function (i.e., (\(Y = X \beta + \epsilon\)), where (\(\epsilon\)) is normally distributed).
GLM: Uses a link function to transform the linear predictor to accommodate different types of response variables. Common link functions include:
- Logit link for binary data (logistic regression)
- Log link for count data (Poisson regression)
- Identity link for normal data (same as in LM)

3. Error Distribution

LM: Assumes errors are normally distributed with constant variance (homoscedasticity).
GLM: Allows for different error distributions (e.g., binomial, Poisson, gamma) to better suit the data.

4. Use Cases

LM: Used when the response variable is continuous, normally distributed, and has a linear relationship with predictors.
GLM: Used when the response variable does not fit these assumptions, such as binary outcomes (yes/no), counts, or proportions.

5. Examples

LM: Simple linear regression, multiple linear regression
GLM: Logistic regression, Poisson regression, negative binomial regression, etc.

In summary, GLMs generalize LMs by allowing for non-normal distributions and providing flexibility with link functions, making them more suitable for a wider range of data types and applications.

In summary, the GLM combines the linear predictor, link function, and probability distribution to model the relationship between the mean of the response variable and the predictors, allowing for flexibility in handling various data types. The specific form of the GLM will depend on the chosen link function and distribution.

GLM Models in R

Before starting, ensure you have R and the necessary packages installed. Key packages include stats (for basic GLMs) and mgcv (for Generalized Additive Models). You may also need packages such as MASS for ordinal models and betareg for Beta regression.

# Install necessary packages if you haven't already
install.packages(c("MASS", "mgcv", "betareg"))

The basic form of the glm() function is:

glm(formula , family= familytype(link=linkfunction), data=)

Family objects are a convenient way to specify the models used by functions like glm(). See help(family) for other allowable link functions for each family.

binomial(link = "logit")

gaussian(link = "identity")

Gamma(link = "inverse")

inverse.gaussian(link = "1/mu\^2")

poisson(link = "log")

quasi(link = "identity", variance = "constant")

quasibinomial(link = "logit")

quasipoisson(link = "log")

There are several GLM model families depending on the make-up of the response variable.

1. Generalized Linear Regression (Gaussian)

A Gaussian GLM is essentially linear regression and is useful when the response variable is continuous and normally distributed.

# Example of linear regression
model_gaussian <- glm(y ~ x1 + x2, data = data, family = gaussian)
summary(model_gaussian)

2. Logistic Regression (Binary)

Logistic regression models binary outcomes (0 or 1) and uses the logit link function to model probabilities.

# Example of logistic regression
model_logistic <- glm(y ~ x1 + x2, data = data, family = binomial)
summary(model_logistic)

3. Probit Regression

Probit regression is similar to logistic regression but uses the probit link function. It’s useful for modeling binary outcomes when the probit function is a better fit than the logit.

# Example of probit regression
model_probit <- glm(y ~ x1 + x2, data = data, family = binomial(link = "probit"))
summary(model_probit)

4. Ordinal Regression

Ordinal regression models ordered categorical outcomes. The polr function from the MASS package can be used for this purpose.

# Example of ordinal regression
library(MASS)
model_ordinal <- polr(y ~ x1 + x2, data = data, Hess = TRUE)
summary(model_ordinal)

5. Multinomial Logistic

For nominal categorical responses with more than two levels, multinomial logistic regression can be used. The nnet package provides multinom for fitting such models.

# Example of multinomial logistic regression
library(nnet)
model_multinom <- multinom(y ~ x1 + x2, data = data)
summary(model_multinom)

6. Poisson Regression

Poisson regression is used for count data and models the log of the expected counts.

# Example of Poisson regression
model_poisson <- glm(y ~ x1 + x2, data = data, family = poisson)
summary(model_poisson)

7. Gamma Regression

Gamma regression is useful for modeling positive continuous outcomes with skewness.

# Example of Gamma regression
model_gamma <- glm(y ~ x1 + x2, data = data, family = Gamma(link = "log"))
summary(model_gamma)

8. Beta Regression

Beta regression is used for modeling continuous data bounded between 0 and 1. The betareg package provides the betareg function for this purpose.

# Example of Beta regression
library(betareg)
model_beta <- betareg(y ~ x1 + x2, data = data)
summary(model_beta)

9. Generalized Additive Model (GAM)

GAMs allow for flexible relationships between predictors and the response by using smooth functions. The mgcv package’s gam function is used for GAMs.

# Example of GAM
library(mgcv)
model_gam <- gam(y ~ s(x1) + s(x2), data = data, family = gaussian)
summary(model_gam)

Required R Packages

The following R packages are required for running the code examples in this tutorial:

Here’s the organized list of R packages grouped by primary use, with duplicates removed: Here’s the comprehensive list with descriptions and references for all packages, organized by category:

Data Wrangling & Visualization

tidyverse (Wickham): Meta-package for data science workflows (includes dplyr, ggplot2, tidyr). Streamlines data manipulation and visualization.
tidyverse.org | CRAN
plyr (Wickham): Split-apply-combine workflows (predecessor to dplyr).
CRAN
patchwork (Pedersen): Combine multiple ggplot2 plots into unified layouts.
CRAN
RColorBrewer (Neuwirth): Color palettes for thematic maps and statistical graphics.
CRAN
GGally (Schloerke): Extend ggplot2 with correlation matrices and multivariate plots.
CRAN

Exploratory Data Analysis (EDA)

dlookr (Ruy C): Automate data quality checks, outlier detection, and EDA reports.
CRAN
DataExplorer (Peters): Quickly profile datasets with automatic visualizations.
CRAN

Statistical Tests & Diagnostics

rstatix (Kassambara): Tidy-friendly interface for t-tests, ANOVA, and non-parametric tests.
CRAN
lmtest (Zeileis & Hothorn): Diagnostic tests for linear models (e.g., Breusch-Pagan).
CRAN
generalhoslem (Kassambara): Goodness-of-fit tests for logistic regression models.
CRAN
moments (Komsta & Novomestky): Calculate skewness, kurtosis, and distribution moments.
CRAN

Modeling & Regression

MASS (Venables & Ripley): Robust regression, LDA, and negative binomial GLMs.
CRAN
AER (Kleiber & Zeileis): Applied econometric models (tobit, IV regression, etc.).
CRAN
VGAM (Yee): Vector generalized linear/additive models for complex responses.
CRAN
pscl (Jackman): Zero-inflated count models and hurdle regression.
CRAN
betareg (Cribari-Neto & Zeileis): Regression for proportional/rate data (0-1 outcomes).
CRAN
glmnet (Friedman et al.): Lasso, ridge, and elastic-net regularization for GLMs.
CRAN
nnet (Venables & Ripley): Feed-forward neural networks and multinomial regression.
CRAN

Generalized Additive Models (GAMs)

mgcv (Wood): GAMs with automatic smoothness selection via REML.
CRAN
gamlss (Rigby & Stasinopoulos): Flexible GAMs for location, scale, and shape parameters.
CRAN
gam (Hastie & Tibshirani): Original implementation of generalized additive models.
CRAN
gratia (Wood): Diagnostic plots and utilities for mgcv models.
CRAN
gamair (Wood): Datasets companion for GAM modeling.
CRAN

Model Evaluation & Interpretation

performance (easystats): Model diagnostics (R², RMSE, multicollinearity checks).
CRAN
Metrics (Hamner): Common ML metrics (AUC, RMSE, MAE).
CRAN
metrica (Garcia): Classification metrics (precision, recall, F1-score).
CRAN
margins (Leeper): Calculate marginal effects for regression models.
CRAN
marginaleffects (Arel-Bundock): Predictions, contrasts, and slopes for models.
CRAN
ggeffects (Lüdecke): Tidy marginal effects for plotting with ggplot2.
CRAN
report (easystats): Automatically generate model interpretation reports.
Documentation

Visualization & Reporting

ggstatsplot (Patil): ggplot2 extensions with statistical annotations.
CRAN
sjPlot (Lüdecke): Visualize model coefficients and diagnostic plots.
CRAN
ggpmisc (Aphalo): Add statistical tables and annotations to plots.
CRAN
jtools (Long): Simplify regression workflows with summ() and effect plots.
CRAN
pROC (Robin et al.): Evaluates and visualizes ROC curves and AUC values.
Reference: CRAN.
ROCR (Sing et al.): Evaluates and visualizes ROC curves and AUC values.
CRAN (Note: pROC is more popular).

Tables & Reporting

flextable (Gohel): Create customizable tables for Word/HTML/PDF reports.
CRAN
kable (Xie): Simple table generator in knitr/R Markdown.
CRAN
kableExtra (Zhu): Enhance kable tables with styling and interactivity.
CRAN
gt (Iannone): Build publication-ready tables with a tidy syntax.
Documentation
gtsummary (Sjoberg): Create demographic and regression summary tables.
Documentation

Robust Statistics

sandwich (Zeileis): Robust covariance matrix estimators for model diagnostics.
CRAN

Miscellaneous

agridat (Friendly): Agricultural experiment datasets for statistical analysis.
CRAN
epiDisplay (Tomas): Tools for epidemiological data analysis and presentation.
CRAN

Key Notes:

Use citation("package_name") in R for academic references
Packages like tidyverse and gt have dedicated documentation websites
Conflict Alert: plyr and dplyr have overlapping functions (use dplyr:: prefix if needed)

Install Required Packages

Code

# List of required packages (corrected and deduplicated)
packages <- c(
  # Data Wrangling & Visualization
  "tidyverse", "plyr", "patchwork", "RColorBrewer", "GGally",
  
  # EDA
  "dlookr", "DataExplorer",
  
  # Statistical Tests
  "rstatix", "lmtest", "generalhoslem", "moments", "Metrics", "metrica",
  
  # Modeling & Regression
  "MASS", "AER", "VGAM", "pscl", "betareg", "glmnet", "nnet",
  
  # GAMs
  "mgcv", "gamlss", "gam", "gratia", "gamair",
  
  # Model Evaluation
  "performance", "Metrics",  "margins", "marginaleffects", "ggeffects", "report",
  
  # Visualization & Reporting
  "ggstatsplot", "sjPlot", "ggpmisc", "jtools", "ROCR", "pROC",
  
  # Tables
  "flextable", "knitr", "kableExtra", "gt", "gtsummary",
  
  # Robust Statistics
  "sandwich",
  
  # Miscellaneous
  "agridat", "epiDisplay"
)

Code

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Code

# Verify installation
cat("Installed packages:\n")

Installed packages:

Code

print(sapply(packages, requireNamespace, quietly = TRUE))

      tidyverse            plyr       patchwork    RColorBrewer          GGally 
           TRUE            TRUE            TRUE            TRUE            TRUE 
         dlookr    DataExplorer         rstatix          lmtest   generalhoslem 
           TRUE            TRUE            TRUE            TRUE            TRUE 
        moments         Metrics         metrica            MASS             AER 
           TRUE            TRUE            TRUE            TRUE            TRUE 
           VGAM            pscl         betareg          glmnet            nnet 
           TRUE            TRUE            TRUE            TRUE            TRUE 
           mgcv          gamlss             gam          gratia          gamair 
           TRUE            TRUE            TRUE            TRUE            TRUE 
    performance         Metrics         margins marginaleffects       ggeffects 
           TRUE            TRUE            TRUE            TRUE            TRUE 
         report     ggstatsplot          sjPlot         ggpmisc          jtools 
           TRUE            TRUE            TRUE            TRUE            TRUE 
           ROCR            pROC       flextable           knitr      kableExtra 
           TRUE            TRUE            TRUE            TRUE            TRUE 
             gt       gtsummary        sandwich         agridat      epiDisplay 
           TRUE            TRUE            TRUE            TRUE            TRUE

Load Pacakges

Code

# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Code

# Check loading status
cat("\nLoading Status:\n")


Loading Status:

Code

print(packages)

 [1] "tidyverse"       "plyr"            "patchwork"       "RColorBrewer"   
 [5] "GGally"          "dlookr"          "DataExplorer"    "rstatix"        
 [9] "lmtest"          "generalhoslem"   "moments"         "Metrics"        
[13] "metrica"         "MASS"            "AER"             "VGAM"           
[17] "pscl"            "betareg"         "glmnet"          "nnet"           
[21] "mgcv"            "gamlss"          "gam"             "gratia"         
[25] "gamair"          "performance"     "Metrics"         "margins"        
[29] "marginaleffects" "ggeffects"       "report"          "ggstatsplot"    
[33] "sjPlot"          "ggpmisc"         "jtools"          "ROCR"           
[37] "pROC"            "flextable"       "knitr"           "kableExtra"     
[41] "gt"              "gtsummary"       "sandwich"        "agridat"        
[45] "epiDisplay"

Code

# Show conflicts
cat("\nKey Conflicts:\n")


Key Conflicts:

Code

conflicts <- conflicts(detail = TRUE)
print(conflicts[names(conflicts) %in% unlist(packages)])

named list()

Summary

This tutorial covered various GLMs, each suited to different types of response data. Use summary() to inspect model results and diagnostics for each type of GLM. These models allow for a range of data structures and distributions, making them a versatile toolset in R for real-world applications.

References

Books

Generalized Linear Models (GLMs)

“Generalized Linear Models” – P. McCullagh and J.A. Nelder
- A classic, foundational text on GLMs.
“Generalized Linear Models with Examples in R” – Peter K. Dunn and Gordon K. Smyth
- Practical applications of GLMs, including Gamma and Beta models, using R.
“Generalized Linear Models and Extensions” – James W. Hardin and Joseph M. Hilbe
- Covers extensions of GLMs, with real-world applications.
“An Introduction to Generalized Linear Models” – Annette J. Dobson and Adrian G. Barnett
- A beginner-friendly introduction with examples in R.
“Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models” – Julian J. Faraway
- Covers GLMs, mixed models, and nonparametric regression with R.
“Applied Regression Analysis and Generalized Linear Models” – John Fox
- A comprehensive introduction to regression analysis, including GLMs.
“Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis” – Frank E. Harrell Jr.
- A detailed discussion of various regression models, including logistic and Poisson regression.
“Categorical Data Analysis” – Alan Agresti
- A detailed look at categorical data analysis, including GLMs.

Logistic Regression and Multinomial Models

“An Introduction to Statistical Learning: with Applications in R” – Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
- Covers logistic regression and other machine learning techniques with R.
“Applied Logistic Regression” – David W. Hosmer Jr., Stanley Lemeshow, and Rodney X. Sturdivant
- A practical guide to logistic regression, including real-world examples.
“Logistic Regression Models” – Joseph M. Hilbe
- An in-depth exploration of logistic regression with applications in R.

Poisson and Count Data Models

“Modeling Count Data” – Joseph M. Hilbe
- A focused discussion on modeling count data, including Poisson regression.
“Zero-Inflated Models and Generalized Linear Mixed Models with R” – Alain F. Zuur and Elena N. Ieno
- Covers Zero-Inflated Poisson models and generalized mixed models.
“Count Data Models with R” – John M. Hilbe
- A detailed guide to count data models, including Hurdle and Zero-Inflated models.
“Statistical Methods for Rates and Proportions” – Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik
- Includes discussions on Poisson and binomial regression methods.
“Introduction to Probability Models” – Sheldon M. Ross
- Covers the Poisson process and other probability models.

Generalized Additive Models (GAMs)

“Generalized Additive Models” – Trevor Hastie and Robert Tibshirani
- The foundational book introducing GAMs.
“Generalized Additive Models: An Introduction with R” – Simon N. Wood
- Covers practical implementation of GAMs using the mgcv package in R.
“Introduction to Generalized Additive Models” – Gareth James
- A beginner-friendly introduction to GAMs.