This tutorial demonstrates how to perform Generalized Linear Models for Panel Data in R. The GLM for Panel Data extends traditional GLMs to handle correlated observations in longitudinal or panel datasets.
Overview
Generalized Linear Models (GLMs) for Panel Data extend traditional GLMs to handle correlated observations in longitudinal or panel datasets. These models account for within-subject or within-group correlations arising from repeated measurements over time, while accommodating non-normal response variables (e.g., binary, count, or ordinal outcomes). Below is a structured explanation:
Below is a step-by-step guide to fitting panel estimators for Generalized Linear Models (GLMs) using the {pglm] package in R for different distributions. pglm() function estimates the maximum likelihood of GLM (binomial and Poisson) and ‘glm-like’ models (Negbin and ordered) on longitudinal data. The function is similar to plm() for linear models but extends to non-Gaussian families.
Gaussian Model
The Gaussian model is a linear panel regression model that assumes a Gaussian distribution for the response variable. It is suitable for continuous outcomes. In this example, we will use the Hedonic dataset from the plm package to predict median house values (mv) using structural variables.
Data
Code
# Load the Hedonic datasetdata("Hedonic", package ="plm")head(Hedonic)
The code below fits a Gaussian regression model for panel data, where the dependent variable is the median value of owner-occupied homes (mv). The model includes various socio-economic and environmental factors as independent variables. The data is from the Hedonic data frame, and the model uses random effects to account for town-specific effects. The estimation process includes a moderate level of printed output and uses the Newton-Raphson method for optimization (methos = "nr). The Newton-Raphson method is an iterative numerical technique used for finding the roots of a real-valued function. In the context of estimation, especially in maximum likelihood estimation (MLE), it is used to find the parameter values that maximize (or minimize) a given likelihood function. The family = gaussian specifies the Gaussian distribution for the dependent variable.
Code
# Fit random effects modelgaussian_model <-pglm(mv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax,data = Hedonic,model ="random", # Random effectsprint.level =2, method ="nr",family = gaussian,index ="townid") # Adjust index if needed
From the summary output, we can interpret the coefficients of the structural variables (e.g., rm for rooms) and the random effects variance (sigma_u for individual, sigma_e for error).
Ordinal Probit Model
The ordinal probit model is used to predict ordinal outcomes with an underlying continuous latent variable. In this example, we will use the Fairness dataset from the pglm package to predict fairness ratings (fair) using income and demographic variables.
id answer good rule driving education recurring
1 1 0 tgv admin yes no no
2 1 0 tgv lottery yes no no
3 1 3 tgv queuing yes no no
4 1 1 tgv compensation yes no no
5 1 2 tgv moral yes no no
6 1 <NA> tgv addsupply yes no no
Fit Ordinal Probit Model
The below code fits an ordinal regression model with a probit link function for panel data, where the dependent variable is the numeric value of an ordinal response (answer). The model includes education level and rule-related factors as independent variables. The data is a subset of the Fairness data frame, specifically the first 105 rows where good is 'parking'. The model uses an ordinal distribution with a probit link function, the bfgs optimization method, and includes random effects. bfgs stands for Broyden–Fletcher–Goldfarb–Shanno algorithm, which is an iterative method for solving unconstrained nonlinear optimization problems. The estimation process includes a moderate level of printed output and uses 5 random draws for simulation-based estimation techniques.
Code
Parking <-subset(Fairness, good =='parking')ordinal_model<-pglm(as.numeric(answer) ~ education + rule, Parking[1:105, ],family =ordinal('probit'), R =5, print.level =2,method ='bfgs', index ='id', model ="random")
initial value 92.307736
iter 2 value 91.435879
iter 3 value 91.231157
iter 4 value 91.092405
iter 5 value 90.945023
iter 6 value 90.935860
iter 7 value 90.796381
iter 8 value 90.719380
iter 9 value 90.669391
iter 10 value 90.648223
iter 11 value 90.646011
iter 12 value 90.645785
iter 13 value 90.643443
iter 14 value 90.641762
iter 15 value 90.641645
iter 16 value 90.641377
iter 17 value 90.641365
iter 17 value 90.641365
iter 17 value 90.641365
final value 90.641365
converged
The binomial probit model is used to predict binary outcomes with a probit link function. In this example, we will use the UnionWage dataset from the pglm package to predict union membership (union) using wages and demographic variables.
id year exper health hours married rural school union wage
1 13 1980 1 no 2672 0 no 14 no 1.197540
2 13 1981 2 no 2320 0 no 14 yes 1.853060
3 13 1982 3 no 2940 0 no 14 no 1.344462
4 13 1983 4 no 2960 0 no 14 no 1.433213
5 13 1984 5 no 3071 0 no 14 no 1.568125
6 13 1985 6 no 2864 0 no 14 no 1.699891
sector occ com region
1 businessrepair service white NorthEast
2 personalservice service white NorthEast
3 businessrepair service white NorthEast
4 businessrepair service white NorthEast
5 personalservice craftfor white NorthEast
6 businessrepair manoffpro white NorthEast
Fit Binomial Probit Model
The code fits a binomial regression model with a probit link function for panel data, where the dependent variable is union membership. The model includes wage, experience, and rural status as independent variables. The data is pooled across individuals and years, using a binomial distribution with a probit link function and the BFGS optimization method. The estimation process includes detailed printed output and uses 5 random draws for simulation-based estimation techniques. The panel structure is specified by the id identifier for individuals and the year variable. The method = "bfgs" pecifies the method used for estimation. "bfgs" stands for Broyden–Fletcher–Goldfarb–Shanno algorithm, which is an iterative method for solving unconstrained nonlinear optimization problems. The model = "pooling": Specifies that a pooled model is used, which means that all data points are treated together rather than allowing for individual-specific effects (i.e., no fixed or random effects).
Initial function value: -2422.802
Initial gradient value:
(Intercept) wage exper ruralyes
5.684342e-14 2.451840e+02 1.026742e+02 -6.714318e+00
initial value 2422.801633
iter 2 value 2422.139675
iter 3 value 2403.731724
iter 4 value 2379.712980
iter 5 value 2378.086129
iter 6 value 2373.508560
iter 7 value 2373.493851
iter 7 value 2373.493840
iter 7 value 2373.493840
final value 2373.493840
converged
The Poisson model is used to predict count outcomes with a log link function. In this example, we will use the PatentsRDUS dataset from the pglm package to predict patent counts (patents) using R&D spending (rd) and year.
The below code fits a Poisson model with lagged R&D spending and sector variables as predictors. The code fits a Poisson regression model for panel data, where the dependent variable is the number of patents. The model incorporates lagged values of the logarithm of R&D expenditure, a scientific sector variable, the logarithm of capital from 1972, and year-specific effects. lag(log(rd), 0:5): This term includes the current value and the lagged values (up to 5 lags) of the logarithm of rd (research and development expenditure). The data is pooled across firms and years, using a Poisson distribution and the Newton-Raphson method for estimation. The panel structure is specified by the cusip identifier for firms and the year variable. model = "pooling": Specifies that a pooled model is used, which means that all data points are treated together rather than allowing for individual-specific effects (i.e., no fixed or random effects).
Code
poisson_model <-pglm(patents ~lag(log(rd), 0:5) + scisect +log(capital72) +factor(year),data = PatentsRDUS,family = poisson, model ="pooling", index =c("cusip", "year"),print.level =0, method="nr")summary(poisson_model)
The negative binomial model is used to predict count outcomes with overdispersion. In this example, we will use the PatentsRDUS dataset from the pglm package to predict patent counts (patents) using R&D spending (rd) and year.
The code below fits a Negative Binomial regression model for panel data, where the dependent variable is the number of patents. The model incorporates lagged values of the logarithm of R&D expenditure, a scientific sector variable, the logarithm of capital from 1972, and year-specific effects. The data is from the PatentsRDUS data frame, and the model (model= "within") uses fixed effects to control for individual-specific effects. The estimation process includes a low level of printed output and uses the Newton-Raphson method (method = "nr") for optimization. The Negative Binomial family (family = negbin)is appropriate for modeling overdispersed count data. The panel structure is specified by the cusip identifier for firms and the year variable.
Code
negbin_model <-pglm(patents ~lag(log(rd), 0:5) + scisect +log(capital72) +factor(year),data = PatentsRDUS, family = negbin, model ="within", print.level =1,method ="nr",index =c('cusip', 'year'))
--------------
successive function values within relative tolerance limit (reltol)
4 iterations
estimate: 1.661392 0.272679 -0.09788661 0.03207622 -0.02039233 0.01622142 -0.009727993 0.01763978 0.2071488 -0.03839267 -0.03994033 -0.1443278 -0.1957518
Function value: -3203.064
The Tobit model, developed by James Tobin in 1958, is a type of regression model designed to estimate linear relationships between variables when there is either left- or right-censoring in the dependent variable. Censoring occurs when the value of the dependent variable is only observed within certain bounds. The Tobit model is particularly useful when the dependent variable has a significant number of observations at a limiting value (e.g., zero). The classic example is modeling household expenditure on durable goods where many households report zero expenditure.
When applying the Tobit model to panel data, we need to account for the repeated observations over time for the same individuals or entities. This involves adding individual-specific effects (fixed or random) to the model.
The Tobit model is used to predict censored outcomes, where the dependent variable is left-censored at 0. In this example, we will use the HealthIns dataset from the pglm package to predict health insurance spending (health_exp) using demographic variables.
id year med mdu coins disease sex age size child
1 125024 1 8.451119 0 0 13.73189 male 42.87748 4 no
2 125024 2 62.075470 2 0 13.73189 male 43.87748 4 no
3 125024 3 0.000000 0 0 13.73189 male 44.87748 4 no
4 125024 4 0.000000 0 0 13.73189 male 45.87748 4 no
5 125024 5 0.000000 0 0 13.73189 male 46.87748 4 no
6 125025 1 0.000000 0 0 13.73189 male 16.59138 4 yes
First we will create new data frame health_pdata with the required variables for the Tobit model. Then the 2209th row from the HealthIns dataset and assigns the result to HealthIns2.
Code
HealthIns$med2 <- HealthIns$med /1000HealthIns2 <- HealthIns[-2209, ]# Check for NA values in the predictors and responsesum(is.na(HealthIns2))
[1] 0
Code
# Remove rows with NA valuesHealthIns2 <-na.omit(HealthIns2)
Fit Tobit Model
The code below fits a Tobit regression model with random effects for the HealthIns2 dataset. The model uses the med variable as the dependent variable and includes mdu, disease, and ageas independent variables. The Tobit model is appropriate for censored dependent variables, and the estimation is performed using the Newton-Raphson method with 5 random draws. The dataset is first processed by creating a new variable, removing a specific row, and taking a random subsample of 200 rows for the analysis.
Code
# Fit the Tobit model with random effectstobit_model <-pglm(med ~ mdu + disease + age, data = HealthIns2,model ='random', family ='tobit', print.level =0,method ='nr', R =5)# Summary of the modelsummary(tobit_model)
Structural variables (e.g., rm) significantly affect house values.
Ordinal Probit
Higher income increases perceived fairness (positive coefficient).
Binomial Probit
Higher wage reduces the likelihood of union membership.
Poisson
R&D spending (rd) increases patent counts (incidence rate ratio >1).
Negative Binomial
Accounts for overdispersion; R&D effect remains significant.
Tobit
income positively correlates with latent health spending (censored at 0).
Key Considerations
Random vs. Fixed Effects: Use the Hausman test to choose between models.
Overdispersion: Prefer negative binomial over Poisson if significant overdispersion exists.
Censoring: Tobit models handle censored outcomes but require normality assumptions.
By leveraging pglm, you can estimate panel GLMs for diverse data types while accounting for unobserved heterogeneity and correlation structures.
Summary and Conclusions
This tutorial demonstrated how to fit Generalized Linear Models (GLMs) for Panel Data in R using the pglm package. GLMs extend traditional linear models to handle correlated observations in longitudinal datasets with non-Gaussian outcomes. We covered various distributions, including Gaussian, ordinal probit, binomial probit, Poisson, negative binomial, and Tobit models, with examples from different datasets. The results provided insights into the relationships between predictors and outcomes, accounting for panel data structures and individual-specific effects. By applying these models, researchers can analyze longitudinal data with diverse response types and address challenges such as within-subject correlation and unobserved heterogeneity. The flexibility of GLMs for panel data allows for robust statistical inference and model selection in longitudinal studies.