2. Generalized Linear Models for Panel Data

This tutorial demonstrates how to perform Generalized Linear Models for Panel Data in R. The GLM for Panel Data extends traditional GLMs to handle correlated observations in longitudinal or panel datasets.

Overview

Generalized Linear Models (GLMs) for Panel Data extend traditional GLMs to handle correlated observations in longitudinal or panel datasets. These models account for within-subject or within-group correlations arising from repeated measurements over time, while accommodating non-normal response variables (e.g., binary, count, or ordinal outcomes). Below is a structured explanation:

GLMs generalize linear regression by allowing:

  • Non-normal distributions (e.g., binomial, Poisson).
  • A link function (e.g., logit, log) connecting the linear predictor to the mean of the response.
  • Model form: \(g(E[Y|X]) = \mathbf{X}\beta\), where \(g\) is the link function.

Panel Data Challenges are addressed by GLMs:

  • Within-Subject Correlation: Repeated measurements violate the independence assumption.
  • Unobserved Heterogeneity: Time-invariant individual/group characteristics.

Approaches for Panel Data GLMs are categorized into:

  • Random Effects (Mixed-Effects) Models: Model individual-specific variability with random effects.
  • Generalized Estimating Equations (GEE): Account for correlation using a working correlation matrix.
  • Fixed Effects Models: Control for time-invariant confounders via subject-specific intercepts.

Install Required R Packages

Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:

Code
packages <- c('tidyverse',
              'DataExplorer',
              'dlookr',
              'DataExplorer',
              'rstatix',
              'corrplot',
              'Hmisc',
              'tseries',
              'stargazer',
              'plm',
              'panelr',
              'pglm',
              'AER',
              'MASS',
              'corrplot',
              'lmtest',
              'ggstatsplot',
              'gtsummary',
              'ggExtra',
              'gridExtra',
              'kableExtra',
              'GGally',
              'gplots',
              'flextable'
              )
#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)


# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Load Packages

Code
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))
Code
# Check loaded packages
cat("Successfully loaded packages:\n")
Successfully loaded packages:
Code
print(search()[grepl("package:", search())])
 [1] "package:flextable"    "package:gplots"       "package:GGally"      
 [4] "package:kableExtra"   "package:gridExtra"    "package:ggExtra"     
 [7] "package:gtsummary"    "package:ggstatsplot"  "package:MASS"        
[10] "package:AER"          "package:survival"     "package:sandwich"    
[13] "package:lmtest"       "package:zoo"          "package:car"         
[16] "package:carData"      "package:pglm"         "package:maxLik"      
[19] "package:miscTools"    "package:panelr"       "package:lme4"        
[22] "package:Matrix"       "package:plm"          "package:stargazer"   
[25] "package:tseries"      "package:Hmisc"        "package:corrplot"    
[28] "package:rstatix"      "package:dlookr"       "package:DataExplorer"
[31] "package:lubridate"    "package:forcats"      "package:stringr"     
[34] "package:dplyr"        "package:purrr"        "package:readr"       
[37] "package:tidyr"        "package:tibble"       "package:ggplot2"     
[40] "package:tidyverse"    "package:stats"        "package:graphics"    
[43] "package:grDevices"    "package:utils"        "package:datasets"    
[46] "package:methods"      "package:base"        

Fit Panel Generalized Linear Models (GLMs) in R

Below is a step-by-step guide to fitting panel estimators for Generalized Linear Models (GLMs) using the {pglm] package in R for different distributions. pglm() function estimates the maximum likelihood of GLM (binomial and Poisson) and ‘glm-like’ models (Negbin and ordered) on longitudinal data. The function is similar to plm() for linear models but extends to non-Gaussian families.

Gaussian Model

The Gaussian model is a linear panel regression model that assumes a Gaussian distribution for the response variable. It is suitable for continuous outcomes. In this example, we will use the Hedonic dataset from the plm package to predict median house values (mv) using structural variables.

Data

Code
# Load the Hedonic dataset
data("Hedonic", package = "plm")
head(Hedonic)
        mv    crim zn    indus chas     nox      rm      age     dis     rad
1 10.08580 0.00632 18 2.309999   no 28.9444 43.2306 65.19995 1.40854 0.00000
2  9.98045 0.02731  0 7.070000   no 21.9961 41.2292 78.89996 1.60283 0.69315
3 10.45450 0.02730  0 7.070000   no 21.9961 51.6242 61.09998 1.60283 0.69315
4 10.41630 0.03237  0 2.179998   no 20.9764 48.9720 45.79999 1.80207 1.09861
5 10.49680 0.06905  0 2.179998   no 20.9764 51.0796 54.19998 1.80207 1.09861
6 10.26470 0.02985  0 2.179998   no 20.9764 41.3449 58.69998 1.80207 1.09861
  tax  ptratio  blacks    lstat townid
1 296 15.29999 0.39690 -3.00074      1
2 242 17.79999 0.39690 -2.39251      2
3 242 17.79999 0.39283 -3.21165      2
4 222 18.70000 0.39464 -3.52744      3
5 222 18.70000 0.39690 -2.93163      3
6 222 18.70000 0.39412 -2.95555      3

Fit Gaussian Model

The code below fits a Gaussian regression model for panel data, where the dependent variable is the median value of owner-occupied homes (mv). The model includes various socio-economic and environmental factors as independent variables. The data is from the Hedonic data frame, and the model uses random effects to account for town-specific effects. The estimation process includes a moderate level of printed output and uses the Newton-Raphson method for optimization (methos = "nr). The Newton-Raphson method is an iterative numerical technique used for finding the roots of a real-valued function. In the context of estimation, especially in maximum likelihood estimation (MLE), it is used to find the parameter values that maximize (or minimize) a given likelihood function. The family = gaussian specifies the Gaussian distribution for the dependent variable.

Code
# Fit random effects model
gaussian_model <- pglm(mv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax,
                       data = Hedonic,
                       model = "random",  # Random effects
                       print.level = 2, 
                       method = "nr",
                       family = gaussian,
                       index = "townid")  # Adjust index if needed 
----- Initial parameters: -----
fcn value: 101.4937 
                parameter initial gradient free
(Intercept) 10.1431982409         48.71098    1
crim        -0.0152229003       5341.74746    1
zn           0.0016597986        212.90370    1
indus       -0.0038685279        649.10677    1
chasyes      0.1627672799       -158.82467    1
nox         -0.0062097842       -703.80929    1
rm           0.0197078131        335.94104    1
age         -0.0033657047       2222.21444    1
dis         -0.2887445566         93.45003    1
rad          0.0788325926        145.00932    1
tax         -0.0007227763      31581.07251    1
sd.id        0.1653761061        -85.03668    1
sd.idios     0.1588553097        518.07729    1
Condition number of the (active) hessian: 6477161 
-----Iteration 1 -----
-----Iteration 2 -----
-----Iteration 3 -----
-----Iteration 4 -----
-----Iteration 5 -----
--------------
successive function values within relative tolerance limit (reltol) 
5  iterations
estimate: 10.0683 -0.007645667 0.0007862368 -0.001630563 -0.01850326 -0.01001396 0.01854551 -0.003185649 -0.1931323 0.09509491 -0.0005449124 0.173103 0.1546246 
Function value: 144.058 
Code
summary(gaussian_model)
--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 5 iterations
Return code 8: successive function values within relative tolerance limit (reltol)
Log-Likelihood: 144.058 
13  free parameters
Estimates:
              Estimate Std. error t value  Pr(> t)    
(Intercept) 10.0682995  0.1436021  70.112  < 2e-16 ***
crim        -0.0076457  0.0012114  -6.311 2.77e-10 ***
zn           0.0007862  0.0008344   0.942 0.346031    
indus       -0.0016306  0.0054942  -0.297 0.766634    
chasyes     -0.0185033  0.0344964  -0.536 0.591695    
nox         -0.0100140  0.0014160  -7.072 1.53e-12 ***
rm           0.0185455  0.0010675  17.373  < 2e-16 ***
age         -0.0031856  0.0004959  -6.423 1.33e-10 ***
dis         -0.1931323  0.0572119  -3.376 0.000736 ***
rad          0.0950949  0.0356706   2.666 0.007678 ** 
tax         -0.0005449  0.0002374  -2.296 0.021688 *  
sd.id        0.1731030  0.0161897  10.692  < 2e-16 ***
sd.idios     0.1546246  0.0053279  29.022  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--------------------------------------------

From the summary output, we can interpret the coefficients of the structural variables (e.g., rm for rooms) and the random effects variance (sigma_u for individual, sigma_e for error).

Ordinal Probit Model

The ordinal probit model is used to predict ordinal outcomes with an underlying continuous latent variable. In this example, we will use the Fairness dataset from the pglm package to predict fairness ratings (fair) using income and demographic variables.

Data

Code
data("Fairness", package = "pglm")        # Ordinal probit
head(Fairness)
  id answer good         rule driving education recurring
1  1      0  tgv        admin     yes        no        no
2  1      0  tgv      lottery     yes        no        no
3  1      3  tgv      queuing     yes        no        no
4  1      1  tgv compensation     yes        no        no
5  1      2  tgv        moral     yes        no        no
6  1   <NA>  tgv    addsupply     yes        no        no

Fit Ordinal Probit Model

The below code fits an ordinal regression model with a probit link function for panel data, where the dependent variable is the numeric value of an ordinal response (answer). The model includes education level and rule-related factors as independent variables. The data is a subset of the Fairness data frame, specifically the first 105 rows where good is 'parking'. The model uses an ordinal distribution with a probit link function, the bfgs optimization method, and includes random effects. bfgs stands for Broyden–Fletcher–Goldfarb–Shanno algorithm, which is an iterative method for solving unconstrained nonlinear optimization problems. The estimation process includes a moderate level of printed output and uses 5 random draws for simulation-based estimation techniques.

Code
Parking <- subset(Fairness, good == 'parking')
ordinal_model<- pglm(as.numeric(answer) ~ education + rule,
           Parking[1:105, ],
           family = ordinal('probit'), 
           R = 5, 
           print.level = 2,
           method = 'bfgs', 
           index = 'id', 
           model = "random")
initial  value 92.307736 
iter   2 value 91.435879
iter   3 value 91.231157
iter   4 value 91.092405
iter   5 value 90.945023
iter   6 value 90.935860
iter   7 value 90.796381
iter   8 value 90.719380
iter   9 value 90.669391
iter  10 value 90.648223
iter  11 value 90.646011
iter  12 value 90.645785
iter  13 value 90.643443
iter  14 value 90.641762
iter  15 value 90.641645
iter  16 value 90.641377
iter  17 value 90.641365
iter  17 value 90.641365
iter  17 value 90.641365
final  value 90.641365 
converged
Code
summary(ordinal_model)
--------------------------------------------
Maximum Likelihood estimation
BFGS maximization, 42 iterations
Return code 0: successful convergence 
Log-Likelihood: -90.64137 
11  free parameters
Estimates:
                 Estimate Std. error t value  Pr(> t)    
(Intercept)       -0.6829     0.4450  -1.535 0.124873    
educationno       -0.1064     0.4560  -0.233 0.815453    
ruleadmin          0.3018     0.4859   0.621 0.534543    
rulelottery       -0.1740     0.5070  -0.343 0.731444    
ruleaddsupply      1.1885     0.4776   2.488 0.012830 *  
rulequeuing        3.0946     0.5486   5.641 1.69e-08 ***
rulemoral          4.0504     0.6138   6.599 4.13e-11 ***
rulecompensation   3.1033     0.5531   5.611 2.01e-08 ***
mu_1               1.4203     0.2809   5.057 4.26e-07 ***
mu_2               2.9207     0.3824   7.638 2.20e-14 ***
sigma              0.8779     0.2450   3.583 0.000339 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--------------------------------------------

Binomial Probit

The binomial probit model is used to predict binary outcomes with a probit link function. In this example, we will use the UnionWage dataset from the pglm package to predict union membership (union) using wages and demographic variables.

Data

Code
data("UnionWage", package = "pglm")       # Binomial probit
head(UnionWage)
  id year exper health hours married rural school union     wage
1 13 1980     1     no  2672       0    no     14    no 1.197540
2 13 1981     2     no  2320       0    no     14   yes 1.853060
3 13 1982     3     no  2940       0    no     14    no 1.344462
4 13 1983     4     no  2960       0    no     14    no 1.433213
5 13 1984     5     no  3071       0    no     14    no 1.568125
6 13 1985     6     no  2864       0    no     14    no 1.699891
           sector       occ   com    region
1  businessrepair   service white NorthEast
2 personalservice   service white NorthEast
3  businessrepair   service white NorthEast
4  businessrepair   service white NorthEast
5 personalservice  craftfor white NorthEast
6  businessrepair manoffpro white NorthEast

Fit Binomial Probit Model

The code fits a binomial regression model with a probit link function for panel data, where the dependent variable is union membership. The model includes wage, experience, and rural status as independent variables. The data is pooled across individuals and years, using a binomial distribution with a probit link function and the BFGS optimization method. The estimation process includes detailed printed output and uses 5 random draws for simulation-based estimation techniques. The panel structure is specified by the id identifier for individuals and the year variable. The method = "bfgs" pecifies the method used for estimation. "bfgs" stands for Broyden–Fletcher–Goldfarb–Shanno algorithm, which is an iterative method for solving unconstrained nonlinear optimization problems. The model = "pooling": Specifies that a pooled model is used, which means that all data points are treated together rather than allowing for individual-specific effects (i.e., no fixed or random effects).

Code
binomial_model <- pglm(union ~ wage + exper + rural, 
                       data= UnionWage, 
                       family = binomial('probit'),
                       model = "pooling", 
                       method = "bfgs", 
                       print.level = 3, R = 5,
                       index = c("id", "year"))
Initial function value: -2422.802 
Initial gradient value:
  (Intercept)          wage         exper      ruralyes 
 5.684342e-14  2.451840e+02  1.026742e+02 -6.714318e+00 
initial  value 2422.801633 
iter   2 value 2422.139675
iter   3 value 2403.731724
iter   4 value 2379.712980
iter   5 value 2378.086129
iter   6 value 2373.508560
iter   7 value 2373.493851
iter   7 value 2373.493840
iter   7 value 2373.493840
final  value 2373.493840 
converged
Code
summary(binomial_model)
--------------------------------------------
Maximum Likelihood estimation
BFGS maximization, 37 iterations
Return code 0: successful convergence 
Log-Likelihood: -2373.494 
4  free parameters
Estimates:
             Estimate Std. error t value Pr(> t)    
(Intercept) -1.348722   0.084777 -15.909  <2e-16 ***
wage         0.416122   0.043021   9.673  <2e-16 ***
exper       -0.008492   0.007697  -1.103   0.270    
ruralyes     0.038684   0.052845   0.732   0.464    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--------------------------------------------

Poisson Model

The Poisson model is used to predict count outcomes with a log link function. In this example, we will use the PatentsRDUS dataset from the pglm package to predict patent counts (patents) using R&D spending (rd) and year.

Data

Code
data("PatentsRDUS", package = "pglm")     # Poisson/Negative binomial
head(PatentsRDUS)
  cusip year ardssic scisect  capital72 sumpat         rd patents
1   800 1970      15      no 438.605335    354  2.7097056      30
2  1030 1970      14     yes   7.206043     13  0.6324526       3
3  2824 1970       4     yes 284.004476    493 29.6819762      48
4  4644 1970      13      no   1.981333      2  1.7218512       1
5  7842 1970      16      no   7.880017     12  1.4845475       2
6 10202 1970       3      no 396.533291    393  7.8366387      32

Fit Poisson Model

The below code fits a Poisson model with lagged R&D spending and sector variables as predictors. The code fits a Poisson regression model for panel data, where the dependent variable is the number of patents. The model incorporates lagged values of the logarithm of R&D expenditure, a scientific sector variable, the logarithm of capital from 1972, and year-specific effects. lag(log(rd), 0:5): This term includes the current value and the lagged values (up to 5 lags) of the logarithm of rd (research and development expenditure). The data is pooled across firms and years, using a Poisson distribution and the Newton-Raphson method for estimation. The panel structure is specified by the cusip identifier for firms and the year variable. model = "pooling": Specifies that a pooled model is used, which means that all data points are treated together rather than allowing for individual-specific effects (i.e., no fixed or random effects).

Code
poisson_model <- pglm(patents ~ lag(log(rd), 0:5) + scisect + log(capital72) + factor(year),
                      data = PatentsRDUS,
                      family = poisson, 
                      model = "pooling", 
                      index = c("cusip", "year"),
                      print.level = 0, 
                      method="nr")

summary(poisson_model)
--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 6 iterations
Return code 8: successive function values within relative tolerance limit (reltol)
Log-Likelihood: -17834.14 
13  free parameters
Estimates:
                    Estimate Std. error t value  Pr(> t)    
(Intercept)         0.809910   0.021189  38.224  < 2e-16 ***
lag(log(rd), 0:5)0  0.134525   0.030718   4.379 1.19e-05 ***
lag(log(rd), 0:5)1 -0.052944   0.042797  -1.237 0.216050    
lag(log(rd), 0:5)2  0.008229   0.039784   0.207 0.836126    
lag(log(rd), 0:5)3  0.066097   0.036969   1.788 0.073789 .  
lag(log(rd), 0:5)4  0.090181   0.033336   2.705 0.006826 ** 
lag(log(rd), 0:5)5  0.239538   0.022460  10.665  < 2e-16 ***
scisectyes          0.454310   0.009242  49.155  < 2e-16 ***
log(capital72)      0.252863   0.004414  57.283  < 2e-16 ***
factor(year)1976   -0.043515   0.013121  -3.316 0.000912 ***
factor(year)1977   -0.052441   0.013288  -3.947 7.93e-05 ***
factor(year)1978   -0.170242   0.013530 -12.583  < 2e-16 ***
factor(year)1979   -0.201879   0.013534 -14.917  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--------------------------------------------

Negative Binomial

The negative binomial model is used to predict count outcomes with overdispersion. In this example, we will use the PatentsRDUS dataset from the pglm package to predict patent counts (patents) using R&D spending (rd) and year.

Data

Code
data("PatentsRDUS", package = "pglm")     # Poisson/Negative binomial
head(PatentsRDUS)
  cusip year ardssic scisect  capital72 sumpat         rd patents
1   800 1970      15      no 438.605335    354  2.7097056      30
2  1030 1970      14     yes   7.206043     13  0.6324526       3
3  2824 1970       4     yes 284.004476    493 29.6819762      48
4  4644 1970      13      no   1.981333      2  1.7218512       1
5  7842 1970      16      no   7.880017     12  1.4845475       2
6 10202 1970       3      no 396.533291    393  7.8366387      32

Fit Negative Binomial Model

The code below fits a Negative Binomial regression model for panel data, where the dependent variable is the number of patents. The model incorporates lagged values of the logarithm of R&D expenditure, a scientific sector variable, the logarithm of capital from 1972, and year-specific effects. The data is from the PatentsRDUS data frame, and the model (model= "within") uses fixed effects to control for individual-specific effects. The estimation process includes a low level of printed output and uses the Newton-Raphson method (method = "nr") for optimization. The Negative Binomial family (family = negbin)is appropriate for modeling overdispersed count data. The panel structure is specified by the cusip identifier for firms and the year variable.

Code
negbin_model <- pglm(patents ~ lag(log(rd), 0:5) + scisect + log(capital72) + factor(year),
                     data = PatentsRDUS, 
                     family = negbin, 
                     model = "within", 
                     print.level = 1,
                     method = "nr",
                    index = c('cusip', 'year'))
--------------
successive function values within relative tolerance limit (reltol) 
4  iterations
estimate: 1.661392 0.272679 -0.09788661 0.03207622 -0.02039233 0.01622142 -0.009727993 0.01763978 0.2071488 -0.03839267 -0.03994033 -0.1443278 -0.1957518 
Function value: -3203.064 
Code
summary(negbin_model)
--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 4 iterations
Return code 8: successive function values within relative tolerance limit (reltol)
Log-Likelihood: -3203.064 
13  free parameters
Estimates:
                    Estimate Std. error t value  Pr(> t)    
(Intercept)         1.661392   0.343552   4.836 1.33e-06 ***
lag(log(rd), 0:5)0  0.272679   0.070784   3.852 0.000117 ***
lag(log(rd), 0:5)1 -0.097887   0.076792  -1.275 0.202416    
lag(log(rd), 0:5)2  0.032076   0.070864   0.453 0.650804    
lag(log(rd), 0:5)3 -0.020392   0.065771  -0.310 0.756522    
lag(log(rd), 0:5)4  0.016221   0.062885   0.258 0.796443    
lag(log(rd), 0:5)5 -0.009728   0.053286  -0.183 0.855143    
scisectyes          0.017640   0.198074   0.089 0.929037    
log(capital72)      0.207149   0.077968   2.657 0.007888 ** 
factor(year)1976   -0.038393   0.024472  -1.569 0.116688    
factor(year)1977   -0.039940   0.025227  -1.583 0.113374    
factor(year)1978   -0.144328   0.026460  -5.455 4.91e-08 ***
factor(year)1979   -0.195752   0.027164  -7.206 5.75e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--------------------------------------------

Tobit Model

The Tobit model, developed by James Tobin in 1958, is a type of regression model designed to estimate linear relationships between variables when there is either left- or right-censoring in the dependent variable. Censoring occurs when the value of the dependent variable is only observed within certain bounds. The Tobit model is particularly useful when the dependent variable has a significant number of observations at a limiting value (e.g., zero). The classic example is modeling household expenditure on durable goods where many households report zero expenditure.

When applying the Tobit model to panel data, we need to account for the repeated observations over time for the same individuals or entities. This involves adding individual-specific effects (fixed or random) to the model.

The Tobit model is used to predict censored outcomes, where the dependent variable is left-censored at 0. In this example, we will use the HealthIns dataset from the pglm package to predict health insurance spending (health_exp) using demographic variables.

Data

Code
data("HealthIns", package = "pglm")       # Tobit
head(HealthIns)
      id year       med mdu coins  disease  sex      age size child
1 125024    1  8.451119   0     0 13.73189 male 42.87748    4    no
2 125024    2 62.075470   2     0 13.73189 male 43.87748    4    no
3 125024    3  0.000000   0     0 13.73189 male 44.87748    4    no
4 125024    4  0.000000   0     0 13.73189 male 45.87748    4    no
5 125024    5  0.000000   0     0 13.73189 male 46.87748    4    no
6 125025    1  0.000000   0     0 13.73189 male 16.59138    4   yes

First we will create new data frame health_pdata with the required variables for the Tobit model. Then the 2209th row from the HealthIns dataset and assigns the result to HealthIns2.

Code
HealthIns$med2 <- HealthIns$med / 1000
HealthIns2 <- HealthIns[-2209, ]

# Check for NA values in the predictors and response
sum(is.na(HealthIns2))
[1] 0
Code
# Remove rows with NA values
HealthIns2 <- na.omit(HealthIns2)

Fit Tobit Model

The code below fits a Tobit regression model with random effects for the HealthIns2 dataset. The model uses the med variable as the dependent variable and includes mdu, disease, and ageas independent variables. The Tobit model is appropriate for censored dependent variables, and the estimation is performed using the Newton-Raphson method with 5 random draws. The dataset is first processed by creating a new variable, removing a specific row, and taking a random subsample of 200 rows for the analysis.

Code
# Fit the Tobit model with random effects
tobit_model <- pglm(med ~ mdu + disease + age, 
           data = HealthIns2,
           model = 'random', 
           family = 'tobit', 
           print.level = 0,
           method = 'nr', R = 5)

# Summary of the model
summary(tobit_model)
--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 5 iterations
Return code 1: gradient close to zero (gradtol)
Log-Likelihood: -127807.9 
6  free parameters
Estimates:
             Estimate Std. error t value  Pr(> t)    
(Intercept) -333.0697    12.6778  -26.27  < 2e-16 ***
mdu           58.6491     1.1701   50.12  < 2e-16 ***
disease        5.5655     0.8382    6.64 3.14e-11 ***
age            5.2772     0.3317   15.91  < 2e-16 ***
sd.idios     675.2935     4.0108  168.37  < 2e-16 ***
sd.id       -161.9779     8.5769  -18.89  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--------------------------------------------

Summary of Results

Model Key Interpretation
Gaussian Structural variables (e.g., rm) significantly affect house values.
Ordinal Probit Higher income increases perceived fairness (positive coefficient).
Binomial Probit Higher wage reduces the likelihood of union membership.
Poisson R&D spending (rd) increases patent counts (incidence rate ratio >1).
Negative Binomial Accounts for overdispersion; R&D effect remains significant.
Tobit income positively correlates with latent health spending (censored at 0).

Key Considerations

  1. Random vs. Fixed Effects: Use the Hausman test to choose between models.
  2. Overdispersion: Prefer negative binomial over Poisson if significant overdispersion exists.
  3. Censoring: Tobit models handle censored outcomes but require normality assumptions.

By leveraging pglm, you can estimate panel GLMs for diverse data types while accounting for unobserved heterogeneity and correlation structures.

Summary and Conclusions

This tutorial demonstrated how to fit Generalized Linear Models (GLMs) for Panel Data in R using the pglm package. GLMs extend traditional linear models to handle correlated observations in longitudinal datasets with non-Gaussian outcomes. We covered various distributions, including Gaussian, ordinal probit, binomial probit, Poisson, negative binomial, and Tobit models, with examples from different datasets. The results provided insights into the relationships between predictors and outcomes, accounting for panel data structures and individual-specific effects. By applying these models, researchers can analyze longitudinal data with diverse response types and address challenges such as within-subject correlation and unobserved heterogeneity. The flexibility of GLMs for panel data allows for robust statistical inference and model selection in longitudinal studies.

References

  1. Panel Regression

  2. Panel Data Regression in R: An Introduction to Longitudinal Data analysis

  3. Panel data econometrics in R

  4. R Tutorial: Panel Data Analysis 1