7. Gamma Regression

Gamma regression is a powerful statistical method for modeling positive, continuous data with a skewed distribution, particularly when variability increases with the mean. It is commonly used in insurance claims, healthcare costs, and rainfall data, where values are non-negative and right-skewed. This tutorial will explore Gamma regression in R through two main approaches. First, we will manually build a Gamma regression model to understand its underlying mechanics, including the log-likelihood function and parameter optimization. Then, we will utilize R’s built-in lm () function to simplify the modeling process and demonstrate practical result interpretation. By the end of this tutorial, you will understand how to apply Gamma regression to real-world data by coding from scratch and using R’s built-in functions.

Overview

Gamma regression is used for modeling positive, continuous data where the variance increases with the mean. It assumes that the response variable \(y\) follows a Gamma distribution, often appropriate for modeling non-negative skewed data, such as response times, rainfall, or insurance claims. Here’s a step-by-step explanation:

Understanding the Gamma Distribution

The Gamma distribution is defined by two parameters: - Shape parameter (\(k\)): Controls the shape or skewness. - Rate parameter (\(theta\)): Controls the spread.

The probability density function (PDF) of the Gamma distribution for a positive variable \(y\) is:

\[ f(y; k, \theta) = \frac{y^{k - 1} e^{-\frac{y}{\theta}}}{\theta^k \Gamma(k)} \]

where \(\Gamma(k)\) is the Gamma function, and both \(k\) and \(\theta\) are positive.

In Gamma regression, we typically work with the mean \(\mu = k\theta\), rather than the shape and rate parameters directly. So, for a response variable \(y\), we model \(y \sim \text{Gamma}(\mu, \theta)\).

Defining the Mean-Variance Relationship

A key property of the Gamma distribution is that:

  • The mean of \(y\) is \(\mu\). -
  • The variance of \(y\) is \(\text{Var}(y) = \theta \mu^2\).

This means the variance grows with the square of the mean, which is helpful for modeling heteroscedastic data where larger values tend to have greater variance.

Specifying the Link Function

In Gamma regression, we assume that the mean \(\mu\) of the response variable depends on the predictor variables \(X\) through a link function \(g(\cdot)\):

\[ g(\mu) = X \beta \]

Common choices for $\(g(\cdot)\) are:

  • Log link: \(g(\mu) = \log(\mu)\)

  • Inverse link: \(g(\mu) = \frac{1}{\mu}\)

The log link is the most common, ensuring that the predicted mean \(\mu\) is always positive. With the log link, we get:

\(\log(\mu) = X \beta \Rightarrow \mu = e\^{X \beta}\)

Setting Up the Likelihood Function

Given \(y_i\) as an observation from a Gamma distribution with mean \(\mu_i\), we define the likelihood function based on the Gamma PDF. For (n) observations, the joint likelihood \(L(\beta)\) is:

\[ L(\beta) = \prod\_{i=1}\^n \frac{y_i^{k - 1} e^{-\frac{y_i}{\theta}}}{\theta^k \Gamma(k)} \]

In practice, we usually work with the log-likelihood, which for a Gamma-distributed \(y_i\) with mean \(\mu_i\) can be expressed as:

\[ \log L(\beta) = \sum_{i=1}^n \left( (k - 1) \log(y_i) - \frac{y_i}{\theta} - k \log(\theta) - \log(\Gamma(k)) \right) \]

Estimating the Parameters

To estimate \(\beta\), we maximize the log-likelihood function with respect to \(\beta\) using an iterative optimization algorithm, such as iteratively reweighted least squares (IRLS) in a generalized linear model (GLM) framework.

For example, with the log link function, we solve for ( ) by maximizing:

\[ \sum_{i=1}^n \left( (k - 1) \log(y_i) - \frac{y_i}{e^{X_i \beta} \cdot k} - k \log(\theta) - \log(\Gamma(k)) \right) \]

Using statistical software, you can estimate \(\beta\) without calculating derivatives manually, as most packages handle this for you.

Making Predictions

Once the parameters \(\beta\) are estimated, predictions can be made using the fitted model:

\[ \hat{\mu} = e^{X \hat{\beta}} \]

When to Use Gamma Regression

_ Gamma regression is typically suitable when: - The response variable is positive and continuous. - The data is right-skewed (meaning that there are more lower values, with a few higher outliers). - The variance increases with the mean, which is common in real-world data, especially in economics and biological sciences.

  • In cases where these conditions are met, a Gamma regression model can accurately capture the data’s behavior, offering a flexible approach for prediction and inference.

Here are some common types of data and use cases where Gamma regression is a good choice:

  1. Monetary Data (Costs, Claims, Expenses)
  • Insurance claims: The size of insurance claims often follows a Gamma-like distribution since claims are always positive and skewed, with few very high claims.
  • Healthcare costs: Medical expenses for treatments or hospital stays often have a Gamma distribution, with many low-cost cases and fewer high-cost ones.
  • Sales or revenue data: Especially for per-customer sales or revenue data, where amounts tend to vary widely and have a positive skew.
  1. Duration or Survival Data
  • Time until an event: For example, time until an employee leaves a job (employee turnover), or time until a product breaks or needs repair.
  • Waiting times: For instance, waiting times in queues, customer service, or transportation are often positive, with a few very long waits.
  1. Biological and Environmental Measurements
  • Reaction times: In psychology and neuroscience, reaction times are often Gamma-distributed, as they are positive, continuous, and can vary widely.
  • Rainfall amounts: Daily rainfall amounts are non-negative and positively skewed, with many days of low rainfall and a few days with very high amounts.
  • Chemical concentration levels: The concentration of a particular chemical in samples, which might have a skewed, positive distribution.
  1. Reliability and Engineering Data
  • Failure times: In engineering and reliability studies, the time until a component or system fails is often modeled with a Gamma distribution.
  • Load data: The amount of load or stress a system can handle before failure, which can vary and has a skewed distribution.

Gamma Regression Model from Scratch

  1. Generate synthetic data that follows a Gamma distribution.
  2. Set up a simple Gamma regression model using a log link function.
  3. Implement the model’s parameter estimation using maximum likelihood.

Here’s the code, along with an explanation of each step.

Generate Synthetic Gamma Data

We’ll generate synthetic data where the response variable \(y\) is Gamma-distributed, with the mean modeled as a function of a predictor variable \(x\) and some known coefficients.

Code
# Set a seed for reproducibility
set.seed(42)

# Generate predictor variable `x`
n <- 100
x <- runif(n, 0, 5)  # x ranges from 0 to 5

# Define true coefficients for the linear model (log-link)
beta0 <- 1.5  # Intercept
beta1 <- 0.5  # Slope

# Calculate mean response based on x and the coefficients, with log link
mu <- exp(beta0 + beta1 * x)

# Define shape parameter for the Gamma distribution
k <- 2  # shape parameter for the Gamma distribution

# Generate response variable `y` with Gamma distributed noise
y <- rgamma(n, shape = k, rate = k / mu)  # rate = shape / mean

Define the Log-Likelihood Function

To estimate the coefficients \(\beta\) of the Gamma regression model, we need to define the log-likelihood function. We will only need the terms involving \(\beta\), where \(\mu_i = e^{X \beta}\). The log-likelihood function for a single observation \(y_i\) is:

Code
# Define the log-likelihood function for Gamma regression
log_likelihood <- function(beta) {
  # Calculate the predicted mean (log-link function)
  mu_pred <- exp(beta[1] + beta[2] * x)
  
  # Calculate the log-likelihood for each observation and sum
  ll <- sum(dgamma(y, shape = k, rate = k / mu_pred, log = TRUE))
  return(-ll)  # Return negative log-likelihood for minimization
}

Estimate Parameters by Maximizing the Log-Likelihood

Now we’ll use optim() to minimize the negative log-likelihood and estimate the coefficients \(\beta_0\) and \(\beta_1\).

Code
# Initial guesses for beta coefficients
initial_guess <- c(0, 0)

# Use optim to find the values of beta that minimize the negative log-likelihood
result <- optim(par = initial_guess, fn = log_likelihood)

# Extract estimated coefficients
beta_est <- result$par
cat("Estimated coefficients:\n")
Estimated coefficients:
Code
cat("beta0 (Intercept):", beta_est[1], "\n")
beta0 (Intercept): 1.554173 
Code
cat("beta1 (Slope):", beta_est[2], "\n")
beta1 (Slope): 0.4932095 

Make Predictions

With the estimated coefficients, we can predict the mean response \(\hat{\mu}\) for new values of \(x\).

Code
# Predicted mean for the original data
mu_pred <- exp(beta_est[1] + beta_est[2] * x)

# Compare the predicted vs actual values
data.frame(x = x, y = y, predicted_mu = mu_pred)
              x          y predicted_mu
1   4.574030217  42.363605    45.156217
2   4.685377066  16.178429    47.705433
3   1.430697674  20.510047     9.581235
4   4.152238130  42.158336    36.675028
5   3.208727594  39.882720    23.028849
6   2.595479746  13.224789    17.018280
7   3.682941573  45.572615    29.096985
8   0.673332986   3.878546     6.594705
9   3.284961452  26.603192    23.911201
10  3.525323920  48.374117    26.920715
11  2.288708881   5.218006    14.628706
12  3.595561258  24.868288    27.869636
13  4.673361236  91.941308    47.423551
14  1.277144122   7.455469     8.882405
15  2.311464113   2.747101    14.793810
16  4.700072614  11.618745    48.052457
17  4.891132142  66.994345    52.800763
18  0.587436808   8.455298     6.321157
19  2.374985408  30.057138    15.264627
20  2.801663731   4.514289    18.839957
21  4.520156936  55.587090    43.972179
22  0.693550839   6.725307     6.660794
23  4.944458645  16.230864    54.207907
24  4.733341163  50.845136    48.847423
25  0.412187790   4.569657     5.797737
26  2.571058922   5.167799    16.814531
27  1.951017336   8.057439    12.384344
28  4.528690655  13.073700    44.157644
29  2.234848141   8.534290    14.245216
30  4.180021300   8.294648    37.181042
31  3.687978089  37.898052    29.169354
32  4.055275706  62.801401    34.962404
33  1.940541414   5.756098    12.320521
34  3.425848647  30.156167    25.631803
35  0.019741694   6.038733     4.777463
36  4.164580401  16.046543    36.898962
37  0.036670734   5.329703     4.817519
38  1.038294864   6.211904     7.895316
39  4.533007039  31.760641    44.251751
40  3.058893217  45.992335    21.388384
41  1.897796203  16.401822    12.063495
42  2.178857925   4.025186    13.857218
43  0.187155164   2.628685     5.188681
44  4.867699569 116.437934    52.194048
45  2.158756244  19.284603    13.720511
46  4.787882983  98.651656    50.179279
47  4.438774528  12.638543    42.242151
48  3.199893847  56.248937    22.928733
49  4.854833052  48.076299    51.863879
50  3.094191037   4.385258    21.764000
51  1.667136056   7.736406    10.766292
52  1.733741241  17.011555    11.125841
53  1.992427057  22.948527    12.639878
54  3.923463878   7.286450    32.761776
55  0.194682456   1.125100     5.207980
56  3.743976931  14.274837    29.986215
57  3.386384151  13.586702    25.137723
58  0.856321652   2.096797     7.217573
59  1.305439819  18.392507     9.007235
60  2.572064673   8.241148    16.822874
61  3.378036373  22.283880    25.034439
62  4.914085990 106.488509    53.401920
63  3.797721338   7.101178    30.791696
64  2.832442120  36.293413    19.128133
65  4.248448593  55.858641    38.457280
66  0.947369677   8.792877     7.549071
67  1.356433074  17.407641     9.236643
68  4.140792426  28.527225    36.468576
69  3.466024102  35.464669    26.144761
70  1.202723698   5.036316     8.562289
71  0.214943980   2.725892     5.260285
72  0.702395471  13.634997     6.689913
73  1.081927075  10.569742     8.067063
74  2.396992821   9.813454    15.431215
75  0.987051711   1.993461     7.698273
76  3.596779188  29.101330    27.886382
77  0.039423694   3.410414     4.824065
78  1.877449823  11.565367    11.943042
79  2.572038542  41.071944    16.822658
80  0.007852771   3.973394     4.749531
81  2.908020013  36.284602    19.854603
82  0.789526041   2.969797     6.983670
83  1.795141529  13.641077    11.467921
84  3.228159392  25.389071    23.250617
85  3.879116813  10.140482    32.052977
86  2.818234208  26.542136    18.994561
87  1.168516993   5.665246     8.419046
88  0.449902582   5.122125     5.906592
89  0.428060325   4.054134     5.843303
90  1.526091847  12.416021    10.042798
91  3.337132573   2.381031    24.534451
92  0.001194483  11.282857     4.733959
93  1.042849785   6.170992     7.913073
94  4.665170637  32.298556    47.232361
95  4.628223743  34.994928    46.379461
96  3.670471505  23.076055    28.918578
97  1.665359917  14.000533    10.756865
98  2.575316649   3.874249    16.849878
99  3.719873232   2.985145    29.631845
100 3.095796200  23.157457    21.781237

Prediction Performance

Code
# Performance metrics
MAE <- mean(abs(y - mu_pred))                # Mean Absolute Error
MSE <- mean((y - mu_pred)^2)                 # Mean Squared Error
SS_res <- sum((y - mu_pred)^2)               # Residual sum of squares
SS_tot <- sum((y - mean(y))^2)               # Total sum of squares
R_squared <- 1 - SS_res / SS_tot             # R-squared


# Summary statistics table
summary_table <- data.frame(
  Metric = c("Intercept (beta0)", "Slope (beta1)", "MAE", "MSE", "R-squared"),
  Value = c(beta_est[1], beta_est[2], MAE, MSE, R_squared)
)

cat("\nSummary Statistics Table:\n")

Summary Statistics Table:
Code
print(summary_table)
             Metric       Value
1 Intercept (beta0)   1.5541728
2     Slope (beta1)   0.4932095
3               MAE  11.9322726
4               MSE 299.7897444
5         R-squared   0.4362400

Cross-validation

Let’s implement k-fold cross-validation to evaluate the performance of our Gamma regression model. Cross-validation splits the dataset into \(k\) subsets (or “folds”) and uses each fold in turn for validation while training the model on the remaining \(−1\) folds. Here’s how we can set it up in R:

Code
# Set seed for reproducibility
set.seed(42)

# Define the number of folds
k_folds <- 5
n <- length(y)
folds <- sample(rep(1:k_folds, length.out = n))

# Define a function to calculate Mean Squared Error (MSE)
mse <- function(actual, predicted) {
  mean((actual - predicted)^2)
}

# Initialize vector to store MSE for each fold
mse_values <- numeric(k_folds)

for (fold in 1:k_folds) {
  # Split data into training and validation sets
  train_index <- which(folds != fold)
  test_index <- which(folds == fold)
  
  x_train <- x[train_index]
  y_train <- y[train_index]
  
  x_test <- x[test_index]
  y_test <- y[test_index]
  
  # Redefine the log-likelihood function for the training set
  log_likelihood_train <- function(beta) {
    mu_pred <- exp(beta[1] + beta[2] * x_train)
    ll <- sum(dgamma(y_train, shape = k, rate = k / mu_pred, log = TRUE))
    return(-ll)
  }
  
  # Estimate parameters using training data
  result <- optim(par = c(0, 0), fn = log_likelihood_train)
  beta_est <- result$par
  
  # Make predictions on the test set
  mu_pred_test <- exp(beta_est[1] + beta_est[2] * x_test)
  
  # Calculate MSE for this fold and store it
  mse_values[fold] <- mse(y_test, mu_pred_test)
}

# Calculate average MSE across all folds
average_mse <- mean(mse_values)

# Display results
cat("MSE for each fold:\n")
MSE for each fold:
Code
print(mse_values)
[1] 503.5098 232.0111 218.7052 120.2255 500.7965
Code
cat("Average MSE across folds:", average_mse, "\n")
Average MSE across folds: 315.0496 

Gamma Regression in R

Gamma regression is implemented in R using the glm() function, which fits generalized linear models (GLMs) with various distributions and link functions. To perform Gamma regression in R, you need to specify the family = Gamma(link = "log") argument in the glm() function. Here’s a step-by-step guide to fitting a Gamma regression model in R:

Install Required R Packages

Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:

Code
packages <- c('tidyverse',
     'plyr',
      'DataExplorer',
         'dlookr',
         'rstatix',
       'flextable',
         'gtsummary',
         'performance',
       'report',
         'sjPlot',
         'margins',
         'marginaleffects',
         'ggeffects',
         'patchwork',
         'metrica'
        )
#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Load R-packages

Code
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))
Code
# Check loaded packages
cat("Successfully loaded packages:\n")
Successfully loaded packages:
Code
print(search()[grepl("package:", search())])
 [1] "package:metrica"         "package:patchwork"      
 [3] "package:ggeffects"       "package:marginaleffects"
 [5] "package:margins"         "package:sjPlot"         
 [7] "package:report"          "package:performance"    
 [9] "package:gtsummary"       "package:flextable"      
[11] "package:rstatix"         "package:dlookr"         
[13] "package:DataExplorer"    "package:plyr"           
[15] "package:lubridate"       "package:forcats"        
[17] "package:stringr"         "package:dplyr"          
[19] "package:purrr"           "package:readr"          
[21] "package:tidyr"           "package:tibble"         
[23] "package:ggplot2"         "package:tidyverse"      
[25] "package:stats"           "package:graphics"       
[27] "package:grDevices"       "package:utils"          
[29] "package:datasets"        "package:methods"        
[31] "package:base"           

Data

In this tutorial, we will be using Insurance Premium Data data set. This data consists of the following fields:

Age: The age column ranges from 18 to 64 years, which is the typical working age range for health insurance beneficiaries.

Sex: The sex column contains two unique values, likely male and female.

BMI (Body Mass Index): The BMI column values range from 15.96 to 53.13, which covers a wide range of body mass indices.

Children: The children column indicates that beneficiaries have between 0 and 5 children/dependents covered by their insurance.

Smoker: The smoker column contains two unique values, likely indicating whether the beneficiary is a smoker or not.

Region: The region column has four unique values, corresponding to the four regions in the US where beneficiaries are located.

Charges: The charges column, which seems to be a continuous variable, ranges from 1121.87 to 63770.43, indicating a wide range of individual medical costs billed by health insurance.

We will use read_csv() function of {readr} package to import data as a tidy data.

Code
# Load data
mf<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/Regression_analysis/insurance_premium.csv") |> 
  glimpse()
Rows: 1,338
Columns: 7
$ age      <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
$ sex      <chr> "female", "male", "male", "male", "male", "female", "female",…
$ bmi      <dbl> 27.9, 33.8, 33.0, 22.7, 28.9, 25.7, 33.4, 27.7, 29.8, 25.8, 2…
$ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ region   <chr> "southwest", "southeast", "southeast", "northwest", "northwes…
$ expenses <dbl> 16884.92, 1725.55, 4449.46, 21984.47, 3866.86, 3756.62, 8240.…

This Insurance Premium Dataset is indeed a good candidate for Gamma regression, and here’s why:

  1. Continuous and Positive Target Variable
  • The charges or premium variable, which represents the medical costs or insurance premiums, is a continuous and positive variable. Gamma regression is specifically designed to model such outcomes, as it requires the target variable to be strictly positive.
  1. Skewed Distribution of Charges
  • Insurance costs or premiums often have a right-skewed distribution. A small number of high-cost cases (e.g., due to severe health conditions) can drive up the mean. Gamma regression is well-suited for such skewed distributions because the Gamma distribution can handle this asymmetry.
  1. Variance Increases with Mean
  • In insurance datasets, higher expected charges (mean) often come with higher variability. For example, older individuals or those with riskier health factors (e.g., smoking, high BMI) are expected to have both higher average costs and greater variability in costs. Gamma regression handles this by allowing the variance to increase with the mean, which matches the characteristics of the dataset.
  1. Predictors with Log-Linear Relationships
  • Gamma regression typically uses a log link function, meaning that the relationship between the predictors (e.g., age, BMI, smoker status) and the expected charges is multiplicative. This allows Gamma regression to capture nonlinear relationships effectively and can reflect that costs increase exponentially with risk factors rather than linearly.

Data Processing

Dataset contains BMI values studied subjects which was a calculation of a body person’s weight (in kilograms) divided by the square of their height (in meters). We will categorize this this values into four classes according to the CDC recommendation:

  • Underweight: Less than 18.5

  • Healthy Weight: 18.5 to less than 25

  • Overweight: 25 to less than 30

  • Obesity: 30 or greater

Code
mf$bmi_categories<- cut(mf$bmi, 
                   breaks=c(-Inf, 18.5, 25, 29.9,  Inf), 
                   labels=c("underweight",
                            "normal", 
                            "overweight",
                            "obesity"))

Convert to Factor

Code
mf$sex<- as.factor(mf$sex)
mf$smoker <- as.factor(mf$smoker)
mf$region <- as.factor(mf$region)

Data Description

First we use We use diagnose() function of {dlookr} package to do general general diagnosis of all variables.

Code
#library (flextable)
dlookr::diagnose(mf) |>  
  flextable() 

variables

types

missing_count

missing_percent

unique_count

unique_rate

age

numeric

0

0

47

0.035127055

sex

factor

0

0

2

0.001494768

bmi

numeric

0

0

275

0.205530643

children

numeric

0

0

6

0.004484305

smoker

factor

0

0

2

0.001494768

region

factor

0

0

4

0.002989537

expenses

numeric

0

0

1,337

0.999252616

bmi_categories

factor

0

0

4

0.002989537

Then we will use plot_histogram() function of {DataExpler} package to see the distribution expenses data.

Code
mf  |>  
  dplyr::select(expenses)  |> 
  plot_histogram()

Code
moments::skewness(mf$expenses, na.rm = TRUE)
[1] 1.51418

High positive value indicates that the distribution is highly skewed at right-hand sight, which means that in some individual have high expenses (>20,000) in health insurance.

plot_bar() function from {DataExploer} package create frequency distribution of expenses, based on bmi, _catagories, sex, smoking and regions:

Code
mf  |> 
  dplyr::select(bmi_categories, sex, smoker, region)  |> 
  plot_bar()

Split Data

We will split the data into training and testing sets based on the bmi_categories, and sex variables. The training set will contain 70% of the data, and the testing set will contain the remaining 30%.

Code
seeds = 11076
tr_prop = 0.70
# training data (70% data)
train= ddply(mf,.(bmi_categories, sex),
                 function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test = ddply(mf, .(bmi_categories, sex),
            function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
Code
# Density plot all, train and test data 
ggplot()+
  geom_density(data = mf, aes(expenses))+
  geom_density(data = train, aes(expenses), color = "green")+
  geom_density(data = test, aes(expenses), color = "red") +
      xlab("Expenses") + 
     ylab("Density")

Fit a Gamma Model in R

Only Intercept model

First we will fit Gamma model with intercept-only. l. This means modeling the data with no predictors. For gamma models, this would mean estimating the shape and scale parameters of the data. Notice we set link = "log". This is in response to errors and warnings that we received when using link = "identity" and link = "inverse" to fit more complicated models

Code
# Fit the ordinal logistic regression model
inter.gamma<-glm(expenses ~ 1, data =train, 
          family = Gamma(link = "log"))
summary(inter.gamma)

Call:
glm(formula = expenses ~ 1, family = Gamma(link = "log"), data = train)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.46533    0.02974   318.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.8244218)

    Null deviance: 728.98  on 931  degrees of freedom
Residual deviance: 728.98  on 931  degrees of freedom
AIC: 19454

Number of Fisher Scoring iterations: 5

From above output, we notice the null and residual deviance are identical. The null deviance always reports deviance for the intercept-only model.

If we exponentiate the intercept we get the overall mean of expenses.

Code
exp(coef(inter.gamma))
(Intercept) 
   12904.54 

In the Gamma family, the dispersion parameter can be estimated from the deviance and the degrees of freedom in the summary output.

Code
# Extract deviance and residual degrees of freedom from the model summary
deviance_val <- inter.gamma$deviance
residual_df <- inter.gamma$df.residua
# Calculate the dispersion parameter
dispersion_parameter <- deviance_val / residual_df
dispersion_parameter
[1] 0.7830084

We can also extract the estimated shape and scale parameters as follows:

Code
# shape: 1 divided by dispersion parameter
inter.gamma_shape <- 1/dispersion_parameter

# scale: mean/shape
inter.gamma_scale <- as.numeric(exp(coef(inter.gamma)))/inter.gamma_shape

Now let’s use the estimated shape and scale parameters to draw the estimated gamma distribution on top of a histogram of the observed data.

Code
hist(train$expenses, breaks = 40, freq = FALSE, 
     xlab = "Expenses",
     main= "Null Model")
curve(dgamma(x, shape = inter.gamma_shape , 
            scale = inter.gamma_scale ), 
           from = 0, to = 70000, col = "red", add = TRUE, )

Full Model

Now, let’s fit a Gamma regression model. We’ll predict charges using age, sex, bmi_cra children, smoker, and region. We’ll use a log link function to ensure the predicted values are positive.

Code
# Fit the gamma model
fit.gamma<-glm(expenses~bmi_categories + sex +  smoker + region + children + age, 
               data= train,
              family = Gamma(link = "log"))

Model Summary

Code
summary(fit.gamma)

Call:
glm(formula = expenses ~ bmi_categories + sex + smoker + region + 
    children + age, family = Gamma(link = "log"), data = train)

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               7.506578   0.196173  38.265  < 2e-16 ***
bmi_categoriesnormal      0.273121   0.192829   1.416  0.15700    
bmi_categoriesoverweight  0.295163   0.189897   1.554  0.12045    
bmi_categoriesobesity     0.428294   0.188581   2.271  0.02337 *  
sexmale                  -0.052456   0.045269  -1.159  0.24686    
smokeryes                 1.502742   0.057614  26.083  < 2e-16 ***
regionnorthwest          -0.084428   0.065172  -1.295  0.19549    
regionsoutheast          -0.192420   0.064621  -2.978  0.00298 ** 
regionsouthwest          -0.164759   0.065054  -2.533  0.01149 *  
children                  0.090770   0.019304   4.702 2.97e-06 ***
age                       0.027960   0.001627  17.182  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.4746951)

    Null deviance: 728.98  on 931  degrees of freedom
Residual deviance: 242.17  on 921  degrees of freedom
AIC: 18368

Number of Fisher Scoring iterations: 7

Check the Overall Model Fit

Code
anova(inter.gamma,fit.gamma)
Analysis of Deviance Table

Model 1: expenses ~ 1
Model 2: expenses ~ bmi_categories + sex + smoker + region + children + 
    age
  Resid. Df Resid. Dev Df Deviance      F    Pr(>F)    
1       931     728.98                                 
2       921     242.17 10   486.81 102.55 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the above output, you see that the chi-square is 242 and p = <0.0001. This means that you can reject the null hypothesis that the model without predictors is as good as the model with the predictors

Report Model Summary

Code
report::report(fit.gamma)
We fitted a general linear model (Gamma family with a log link) (estimated
using ML) to predict expenses with bmi_categories, sex, smoker, region,
children and age (formula: expenses ~ bmi_categories + sex + smoker + region +
children + age). The model's explanatory power is substantial (Nagelkerke's R2
= 0.75). The model's intercept, corresponding to bmi_categories = underweight,
sex = female, smoker = no, region = northeast, children = 0 and age = 0, is at
7.51 (95% CI [7.14, 7.91], t(921) = 38.27, p < .001). Within this model:

  - The effect of bmi categories [normal] is statistically non-significant and
positive (beta = 0.27, 95% CI [-0.13, 0.63], t(921) = 1.42, p = 0.157; Std.
beta = 0.27, 95% CI [-0.13, 0.63])
  - The effect of bmi categories [overweight] is statistically non-significant
and positive (beta = 0.30, 95% CI [-0.10, 0.65], t(921) = 1.55, p = 0.120; Std.
beta = 0.30, 95% CI [-0.10, 0.65])
  - The effect of bmi categories [obesity] is statistically significant and
positive (beta = 0.43, 95% CI [0.04, 0.78], t(921) = 2.27, p = 0.023; Std. beta
= 0.43, 95% CI [0.04, 0.78])
  - The effect of sex [male] is statistically non-significant and negative (beta
= -0.05, 95% CI [-0.14, 0.04], t(921) = -1.16, p = 0.247; Std. beta = -0.05,
95% CI [-0.14, 0.04])
  - The effect of smoker [yes] is statistically significant and positive (beta =
1.50, 95% CI [1.39, 1.62], t(921) = 26.08, p < .001; Std. beta = 1.50, 95% CI
[1.39, 1.62])
  - The effect of region [northwest] is statistically non-significant and
negative (beta = -0.08, 95% CI [-0.21, 0.04], t(921) = -1.30, p = 0.195; Std.
beta = -0.08, 95% CI [-0.21, 0.04])
  - The effect of region [southeast] is statistically significant and negative
(beta = -0.19, 95% CI [-0.32, -0.07], t(921) = -2.98, p = 0.003; Std. beta =
-0.19, 95% CI [-0.32, -0.07])
  - The effect of region [southwest] is statistically significant and negative
(beta = -0.16, 95% CI [-0.29, -0.04], t(921) = -2.53, p = 0.011; Std. beta =
-0.16, 95% CI [-0.29, -0.04])
  - The effect of children is statistically significant and positive (beta =
0.09, 95% CI [0.05, 0.13], t(921) = 4.70, p < .001; Std. beta = 0.11, 95% CI
[0.06, 0.15])
  - The effect of age is statistically significant and positive (beta = 0.03, 95%
CI [0.02, 0.03], t(921) = 17.18, p < .001; Std. beta = 0.39, 95% CI [0.35,
0.44])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

The tbl_regression() function from the gtsummary package takes a regression model object as input and produces a formatted table with Odd-ratio and confidence interval.

Code
tbl_regression(fit.gamma, exp =TRUE)
Characteristic exp(Beta) 95% CI p-value
bmi_categories


    underweight
    normal 1.31 0.88, 1.88 0.2
    overweight 1.34 0.91, 1.91 0.12
    obesity 1.53 1.04, 2.18 0.023
sex


    female
    male 0.95 0.87, 1.04 0.2
smoker


    no
    yes 4.49 4.01, 5.05 <0.001
region


    northeast
    northwest 0.92 0.81, 1.04 0.2
    southeast 0.82 0.73, 0.94 0.003
    southwest 0.85 0.75, 0.96 0.011
children 1.10 1.05, 1.14 <0.001
age 1.03 1.03, 1.03 <0.001
Abbreviation: CI = Confidence Interval

The tab_model() function of {sjPlot} package also creates HTML tables from regression models:

Code
tab_model(fit.gamma)
  expenses
Predictors Estimates CI p
(Intercept) 1819.98 1258.72 – 2734.20 <0.001
bmi categories [normal] 1.31 0.88 – 1.88 0.157
bmi categories
[overweight]
1.34 0.91 – 1.91 0.120
bmi categories [obesity] 1.53 1.04 – 2.18 0.023
sex [male] 0.95 0.87 – 1.04 0.247
smoker [yes] 4.49 4.01 – 5.05 <0.001
region [northwest] 0.92 0.81 – 1.04 0.195
region [southeast] 0.82 0.73 – 0.94 0.003
region [southwest] 0.85 0.75 – 0.96 0.011
children 1.10 1.05 – 1.14 <0.001
age 1.03 1.03 – 1.03 <0.001
Observations 932
R2 Nagelkerke 0.750

plot_model() function of {sjPlot} package creates plots the estimates from logistic model:

Code
plot_model(fit.gamma, vline.color = "red")

Shape and Scale of the Fitted Model

Code
# Calculate mean and variance of predicted values
predicted_mu <- fitted(fit.gamma)
mean_mu <- mean(predicted_mu)
var_mu <- var(predicted_mu)

# Estimate shape (k) and scale (theta)
shape_estimate <- mean_mu^2 / var_mu
scale_estimate <- var_mu / mean_mu

cat("Estimated shape (k):", shape_estimate, "\n")
Estimated shape (k): 1.064251 
Code
cat("Estimated scale (theta):", scale_estimate, "\n")
Estimated scale (theta): 12695.67 

Now let’s use the estimated shape and scale parameters to draw the estimated gamma distribution on top of a histogram of the observed data.

Code
hist(train$expenses, breaks = 40, freq = FALSE, 
     xlab = "Expenses",
     main= "Full Model")
curve(dgamma(x, shape = shape_estimate, 
            scale = scale_estimate ), 
           from = 0, to = 70000, col = "red", add = TRUE, )

Model Performance

Code
performance::performance(fit.gamma)
# Indices of model performance

AIC       |      AICc |       BIC | Nagelkerke's R2 |     RMSE | Sigma
----------------------------------------------------------------------
18367.545 | 18367.884 | 18425.593 |           0.750 | 7746.727 | 0.689

Marginal Effect

To calculate marginal effects and adjusted predictions, the predict_response() function of {ggeffects} package is used. This function can return three types of predictions, namely, conditional effects, marginal effects or marginal means, and average marginal effects or counterfactual predictions. You can set the type of prediction you want by using the margin argument.

Code
effect.gamma<-ggeffects::predict_response(fit.gamma, "age", margin = "empirical")
plot(effect.gamma)

Code
age.bmi <- predict_response(fit.gamma, terms = c("age", "bmi_categories"))
plot(age.bmi, facets = TRUE)

Cross-validation

We’ll use 5-fold cross-validation to evaluate the model’s predictive performance using the Mean Squared Error (MSE) as the metric.

Code
# Define MSE function
mse <- function(actual, predicted) {
  mean((actual - predicted)^2)
}

# Set up cross-validation
set.seed(42)
k_folds <- 5
folds <- sample(rep(1:k_folds, length.out = nrow(mf)))
mse_values <- numeric(k_folds)

for (i in 1:k_folds) {
  # Split data into training and validation sets
  train_index <- which(folds != i)
  test_index <- which(folds == i)
  
  train_data <- mf[train_index, ]
  test_data <- mf[test_index, ]
  
  # Fit the Gamma model on training data
  cv_model <- glm(expenses ~ age + sex + bmi_categories + children + smoker + region, 
                  data = train_data, 
                  family = Gamma(link = "log"))
  
  # Predict on the validation set
  predictions <- predict(cv_model, newdata = test_data, type = "response")
  
  # Calculate MSE for the fold
  mse_values[i] <- mse(test_data$expenses, predictions)
}

# Average MSE across all folds
average_mse_cv <- mean(mse_values)
cat("Average MSE from cross-validation:", average_mse_cv, "\n")
Average MSE from cross-validation: 63856032 

Prediction at Test data

Code
# Predict on the test set
test$Pred.gamma <- predict(fit.gamma, newdata = test, type = "response")

Prediction Performance

Code
my.metrics <- c("R2","RMSE", "MAE", "MBE")

metrics_summary(data = test,   
                obs = expenses,    
                pred = Pred.gamma,
                type = "regression",
                metrics_list = my.metrics
                 )    
  Metric        Score
1     R2    0.6959403
2    MAE 4564.6215381
3    MBE -684.8645575
4   RMSE 7959.5498409

1:1 Plot

Code
test.metrica.plot <- scatter_plot(data = test,
                obs = expenses,    
                pred = Pred.gamma,
                print_metrics = TRUE, 
                metrics_list = my.metrics,
                 position_metrics = c(x = 10, y = 75000)) +
                 ggtitle("Gamma Model") +
                theme_light()
test.metrica.plot  

Log Linear model

It’s important to note that traditional linear modeling with a log transformation can be just as effective as gamma regression in some cases. To illustrate this, we’ll use the lm() function to fit a simple linear model with the total value log-transform. Oeef

Code
fit.lm<-glm(log(expenses)~bmi_categories + sex +  smoker + region + children + age, 
               data= train)
summary(fit.lm)

Call:
glm(formula = log(expenses) ~ bmi_categories + sex + smoker + 
    region + children + age, data = train)

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               7.228502   0.128637  56.193  < 2e-16 ***
bmi_categoriesnormal      0.114645   0.126444   0.907 0.364809    
bmi_categoriesoverweight  0.175631   0.124521   1.410 0.158743    
bmi_categoriesobesity     0.300275   0.123658   2.428 0.015361 *  
sexmale                  -0.073391   0.029684  -2.472 0.013602 *  
smokeryes                 1.555938   0.037779  41.185  < 2e-16 ***
regionnorthwest          -0.092564   0.042735  -2.166 0.030569 *  
regionsoutheast          -0.185469   0.042374  -4.377 1.34e-05 ***
regionsouthwest          -0.144643   0.042658  -3.391 0.000727 ***
children                  0.109304   0.012658   8.635  < 2e-16 ***
age                       0.034409   0.001067  32.247  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.2041099)

    Null deviance: 780.74  on 931  degrees of freedom
Residual deviance: 187.99  on 921  degrees of freedom
AIC: 1176.8

Number of Fisher Scoring iterations: 2

Coefficients in linear model slightly different than Gamma model, but overall interpretation is almost identical:

Code
effect.lm<-ggeffects::predict_response(fit.lm, "age", margin = "empirical")
plot(effect.lm)

Summary and Conclusion

In this tutorial, we covered the essentials of Gamma regression in R, focusing on a manual approach and the implementation using R’s glm() function. We discussed when to use Gamma regression, particularly for non-negative, skewed data with increasing variance, which is common in finance, healthcare, environmental science, and engineering. Implementing Gamma regression from scratch helped us understand the model’s mechanics, including the log-likelihood function and coefficient estimation. We then utilized R’s glm() function to fit a Gamma regression model more efficiently, emphasizing its simplicity in estimating parameters and interpreting output metrics like coefficients, standard errors, and goodness-of-fit measures. Gamma regression is vital for analyzing non-negative continuous data with a right-skewed distribution. With the knowledge gained in this tutorial, you can effectively apply Gamma regression to various real-world datasets, providing more accurate insights and predictions in fields such as claims costs, waiting times, and biological measurements.

References

  1. Getting Started with Gamma Regression

  2. Modeling skewed continuous outcome using Gamma family in glm()

  3. Lecture 8: Gamma regression