4. Ordinal or Ordered Logistic Regression (OLR)

Ordered logistic regression is used when the dependent variable is an ordinal outcome—a categorical variable with a precise rank order but not necessarily equidistant. This method estimates each response level’s probability while considering the outcome’s ordered nature. Typical applications include customer satisfaction assessment, academic grading, and income classification. In this tutorial, we will discuss ordered logistic regression, manually fit a model to understand its mechanics, and use the polr() function from the {MASS} package in R to implement the model. We will also examine coefficients, odds ratios, and category thresholds. This tutorial will provide a practical approach to mastering ordinal logistic regression through concepts and hands-on coding.

Overview

Ordinal logistic regression (OLR) is a statistical method used to analyze the relationship between one or more independent variables (predictors) and an ordinal dependent variable. An ordinal variable is a type of categorical variable where the categories have a specific order, but the intervals between the categories are not uniformly measured. This technique is commonly used in social sciences, psychology, and other fields where data is measured on Likert scales (e.g., strongly disagree, disagree, neutral, agree, strongly agree) or in scenarios where outcomes are ordered such as grades (e.g., A, B, C, D, F). OLR allows researchers to understand the impact of independent variables on the likelihood of an outcome falling into a particular category or higher.

In ordinal logistic regression (OLR), the dependent variable is assumed to have three or more ordered categories. The objective is to predict the probability of an observation falling into one of the categories based on the independent variables’ values. OLR is an extension of logistic regression, which is used for binary outcomes, and it is designed to handle multiple ordered outcome categories.

The key difference between ordinal variables and nominal variables (which have no inherent order, like colors or types of fruit) is that ordinal variables convey a rank or order.

Proportional Odds Assumption: The primary assumption in ordinal logistic regression is that the relationship between each pair of outcome groups is the same across all levels of the independent variables. In other words, the odds ratios comparing any two outcome categories are constant across all levels of the predictor variables.

Model Specification:
- Let $Y$ be an ordinal variable with $K$ categories. Define $P(Y \leq k)$ as the probability of being in category $k$ or lower.
- The model assumes that each category has a threshold $\theta_k$ such that:
  
  \[ \text{logit}(P(Y \leq k)) = \theta_k - X\beta \]
  
  for $k = 1, \ldots, K-1$
Link Function:
- The cumulative logit link function is typically used, so:
  
  \[ P(Y \leq k) = \frac{1}{1 + e^{-(\theta_k - X\beta)}} \]
Parameter Estimation:
- Maximum likelihood estimation is used to estimate both $\beta$ and $_k$0.
- The log-likelihood is based on the cumulative probabilities:
  
  \[ \mathcal{L}(\beta, \theta) = \sum_{i=1}^n \sum_{k=1}^{K-1} I(Y_i \leq k) \ln(P(Y_i \leq k)) + I(Y_i > k) \ln(1 - P(Y_i \leq k)) \]
Prediction:
- The model predicts the probability of $Y$ falling within each category by computing $P(Y \leq k)$ for each $k$.
Interpretation:
- The coefficients $\beta$ represent the change in the log-odds of being in a higher vs. lower category for a one-unit increase in $X$.

Ordinal Model from Scratch

Fitting an ordinal logistic regression model from scratch in R without using any external packages involves several steps, including creating synthetic data, calculating the cumulative probabilities, estimating the model coefficients using Maximum Likelihood Estimation (MLE), and deriving the summary statistics and odds ratios with confidence intervals. Below, I will outline how to accomplish this step-by-step.

Generate Synthetic Data

We’ll create a synthetic dataset with 6 continuous predictors, one categorical predictor, and an ordinal response variable with three levels.

Code

# Load required package for numerical derivatives
if (!require(numDeriv)) install.packages("numDeriv")

Loading required package: numDeriv

Code

# Step 1: Generate Synthetic Data
set.seed(123)  # For reproducibility
n <- 200  # Number of observations

# Continuous predictors
X1 <- rnorm(n)
X2 <- rnorm(n)
X3 <- rnorm(n)
X4 <- rnorm(n)


# Categorical predictor (two levels)
land_type <- sample(c("high land", "medium high land"), n, replace = TRUE)

# Convert land_type to a numeric binary variable
land_type_numeric <- ifelse(land_type == "high land", 1, 0)

# Create a linear predictor and generate response variable
linear_predictor <- 1 + 0.5 * X1 - 0.3 * X2 + 0.2 * X3 + 
                    0.7 * land_type_numeric + 
                    rnorm(n)

# Define thresholds for response categories
thresholds <- c(-Inf, 0, 1, Inf)  # Create three categories (Low, Moderate, High)
response <- cut(linear_predictor, breaks = thresholds, labels = c("Low", "Moderate", "High"))

# Combine data into a data frame
data <- data.frame(X1, X2, X3, X4,  land_type_numeric, response)
head(data)

           X1         X2          X3          X4 land_type_numeric response
1 -0.56047565  2.1988103 -0.07355602  1.07401226                 0      Low
2 -0.23017749  1.3124130 -1.16865142 -0.02734697                 1 Moderate
3  1.55870831 -0.2651451 -0.63474826 -0.03333034                 0     High
4  0.07050839  0.5431941 -0.02884155 -1.51606762                 0     High
5  0.12928774 -0.4143399  0.67069597  0.79038534                 1     High
6  1.71506499 -0.4762469 -1.65054654 -0.21073418                 0     High

Define Model Functions

Code

# Step 2: Define Model Functions

# Function to compute cumulative probabilities
cumulative_probs <- function(X, beta, alpha) {
  eta <- cbind(1, X) %*% beta  # Linear predictor with intercept
  prob_low <- 1 / (1 + exp(-(eta - alpha[1])))
  prob_moderate <- prob_low / (1 + exp(-(eta - alpha[2])))
  prob_high <- 1 - prob_moderate
  return(cbind(prob_low, prob_moderate, prob_high))
}

# Function to compute the log-likelihood
log_likelihood <- function(params, data) {
  beta <- params[1:6]  # 8 coefficients for predictors + intercept
  alpha <- params[7:8]  # Two thresholds

  # Define predictor matrix X (5 predictors plus intercept)
  X <- as.matrix(data[, c("X1", "X2", "X3", "X4", "land_type_numeric")])

  response <- factor(data$response, levels = c("Low", "Moderate", "High"))
  
  # Calculate cumulative probabilities
  probs <- cumulative_probs(X, beta, alpha)
  
  # Compute log likelihood
  ll <- sum(log(ifelse(response == "Low", probs[, 1],
                ifelse(response == "Moderate", probs[, 2],
                       probs[, 3]))))
  return(-ll)  # Return negative log-likelihood for minimization
}

Fit Model

Code

# Step 3: Fit the Model

# Function to fit the model using optim
fit_model <- function(data) {
  init_params <- rep(0, 8)  # 8 for beta (including intercept) and 2 thresholds
  result <- optim(init_params, log_likelihood, data = data, method = "BFGS", hessian = TRUE)
  return(result)
}

# Fit the model
model_fit <- fit_model(data)
model_fit$par  # Model coefficients (beta) and thresholds (alpha)

[1]  4.314075703 -0.982913650  0.524043957  0.042348956 -0.009382339
[6] -1.523786205 -8.913837538  4.599729898

Extract Coefficients and Compute Standard Errors

Code

# Step 4: Extract Coefficients and Compute Standard Errors

# Compute the Hessian matrix and standard errors
hessian_matrix <- model_fit$hessian
standard_errors <- sqrt(diag(solve(hessian_matrix)))
standard_errors

[1] 1.677793e+04 2.382154e-01 2.000676e-01 2.103156e-01 1.922579e-01
[6] 4.043400e-01 1.677721e+04 1.677793e+04

Create a Summary Table

Code

# Step 5: Create a Summary Table

# Coefficients and standard errors
coefficients <- model_fit$par
t_values <- coefficients / standard_errors
p_values <- 2 * (1 - pnorm(abs(t_values)))  # Two-tailed p-values

# Summary table
summary_table <- data.frame(
  Value = coefficients,
  Std_Error = standard_errors,
  t_value = t_values,
  p_value = p_values
)

rownames(summary_table) <- c(paste("X", 1:4, sep = ""), "land_high", "Intercept", "alpha_1", "alpha_2")
summary_table

                 Value    Std_Error       t_value      p_value
X1         4.314075703 1.677793e+04  0.0002571280 9.997948e-01
X2        -0.982913650 2.382154e-01 -4.1261545224 3.688795e-05
X3         0.524043957 2.000676e-01  2.6193349716 8.810139e-03
X4         0.042348956 2.103156e-01  0.2013590786 8.404178e-01
land_high -0.009382339 1.922579e-01 -0.0488008122 9.610780e-01
Intercept -1.523786205 4.043400e-01 -3.7685764816 1.641812e-04
alpha_1   -8.913837538 1.677721e+04 -0.0005313062 9.995761e-01
alpha_2    4.599729898 1.677793e+04  0.0002741536 9.997813e-01

Calculate Odds Ratio and Confidence Intervals

Code

# Step 6: Calculate Odds Ratios and Confidence Intervals

# Calculate odds ratios and CI
odds_ratios <- exp(coefficients)
ci_lower <- exp(coefficients - 1.96 * standard_errors)
ci_upper <- exp(coefficients + 1.96 * standard_errors)

# Create a table for odds ratios and CI
odds_ratios_table <- data.frame(
  Odds_Ratio = odds_ratios,
  CI_Lower = ci_lower,
  CI_Upper = ci_upper
)

rownames(odds_ratios_table) <- rownames(summary_table)
odds_ratios_table

            Odds_Ratio  CI_Lower  CI_Upper
X1        7.474451e+01 0.0000000       Inf
X2        3.742192e-01 0.2346135 0.5968966
X3        1.688843e+00 1.1410074 2.4997141
X4        1.043258e+00 0.6908244 1.5754919
land_high 9.906615e-01 0.6796293 1.4440376
Intercept 2.178854e-01 0.0986386 0.4812926
alpha_1   1.345146e-04 0.0000000       Inf
alpha_2   9.945745e+01 0.0000000       Inf

Cross-validation

To evaluate the model’s performance using cross-validation, we’ll use k-fold cross-validation, where we split the data into k folds, train on k-1 folds, and test on the remaining fold. We repeat this process k times, each time using a different fold as the test set, and then compute the average accuracy across all folds.

Code

# Load required package for sampling
if (!require(numDeriv)) install.packages("numDeriv")

# Function to calculate accuracy
calculate_accuracy <- function(true_labels, predicted_labels) {
  mean(true_labels == predicted_labels)
}

# Define the model fitting function
fit_model <- function(data) {
  init_params <- rep(0, 8)  # 8 for beta (including intercept) and 2 thresholds
  result <- optim(init_params, log_likelihood, data = data, method = "BFGS", hessian = TRUE)
  return(result)
}

# Define the prediction function
predict_ordinal <- function(test_data, params) {
  beta <- params[1:6]  # Coefficients for predictors + intercept
  alpha <- params[7:8]  # Thresholds

  # Define predictor matrix X for test data
  X <- as.matrix(test_data[, c("X1", "X2", "X3", "X4",  "land_type_numeric")])
  
  # Compute cumulative probabilities
  eta <- cbind(1, X) %*% beta  # Linear predictor with intercept
  prob_low <- 1 / (1 + exp(-(eta - alpha[1])))
  prob_moderate <- prob_low / (1 + exp(-(eta - alpha[2])))
  prob_high <- 1 - prob_moderate
  
  # Combine probabilities into a matrix
  probs <- cbind(prob_low, prob_moderate, prob_high)
  
  # Assign predicted category based on max probability
  predictions <- apply(probs, 1, function(row) {
    if (row[1] >= row[2] && row[1] >= row[3]) {
      "Low"
    } else if (row[2] >= row[1] && row[2] >= row[3]) {
      "Moderate"
    } else {
      "High"
    }
  })
  return(predictions)
}

# Function for k-fold cross-validation
k_fold_cv <- function(data, k = 5) {
  n <- nrow(data)
  fold_size <- floor(n / k)
  indices <- sample(1:n)
  
  accuracies <- c()  # Store accuracy for each fold
  
  for (i in 1:k) {
    # Create training and validation sets
    val_indices <- indices[((i - 1) * fold_size + 1):(i * fold_size)]
    train_indices <- setdiff(indices, val_indices)
    
    train_data <- data[train_indices, ]
    val_data <- data[val_indices, ]
    
    # Fit the model on the training set
    fold_fit <- fit_model(train_data)
    
    # Predict on validation set
    val_predictions <- predict_ordinal(val_data, fold_fit$par)
    
    # Calculate accuracy for the current fold
    fold_accuracy <- calculate_accuracy(val_data$response, val_predictions)
    accuracies <- c(accuracies, fold_accuracy)
  }
  
  # Return the average cross-validation accuracy
  cv_accuracy <- mean(accuracies)
  return(cv_accuracy)
}

# Perform 5-fold cross-validation on the entire dataset
set.seed(123)
cv_accuracy <- k_fold_cv(data, k = 5)
print(paste("5-Fold Cross-Validation Accuracy:", round(cv_accuracy * 100, 2), "%"))

[1] "5-Fold Cross-Validation Accuracy: 11.5 %"

Fit Ordinal Logistic Regression in R

Fit an ordinal logistic regression model in R using the polr() function from the {MASS} package. We will use the same synthetic dataset created earlier to demonstrate the process. The polr() function fits a proportional odds model to an ordinal response variable using a logistic link function. The model assumes that the coefficients are the same across all levels of the response variable.

Install Required R Packages

Following are the required R packages for this analysis:

Code

packages <- c(
  "tidyverse", 
  "plyr", 
  "gt", 
  "rstatix", 
  "gtsummary",
  "report",
  "performance",
  "jtools", "sjPlot", 
  "margins",
  "marginaleffects",
  "ggeffects", 
  "patchwork", 
  "Metrics", 
  "ggpmisc",
  "metrica",
  "RColorBrewer",
  "MASS", 
  "generalhoslem",
  "kableExtra",
  "numDeriv",
  "ggstatsplot"
)

#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Load R-packages

Code

# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Code

# Check loaded packages
cat("Successfully loaded packages:\n")

Successfully loaded packages:

Code

print(search()[grepl("package:", search())])

 [1] "package:ggstatsplot"     "package:kableExtra"     
 [3] "package:generalhoslem"   "package:reshape"        
 [5] "package:MASS"            "package:RColorBrewer"   
 [7] "package:metrica"         "package:ggpmisc"        
 [9] "package:ggpp"            "package:Metrics"        
[11] "package:patchwork"       "package:ggeffects"      
[13] "package:marginaleffects" "package:margins"        
[15] "package:sjPlot"          "package:jtools"         
[17] "package:performance"     "package:report"         
[19] "package:gtsummary"       "package:rstatix"        
[21] "package:gt"              "package:plyr"           
[23] "package:lubridate"       "package:forcats"        
[25] "package:stringr"         "package:dplyr"          
[27] "package:purrr"           "package:readr"          
[29] "package:tidyr"           "package:tibble"         
[31] "package:ggplot2"         "package:tidyverse"      
[33] "package:numDeriv"        "package:stats"          
[35] "package:graphics"        "package:grDevices"      
[37] "package:utils"           "package:datasets"       
[39] "package:methods"         "package:base"

Data

Our goal is to develop a Ordinal regression model to predict ordinal class of paddy soil arsenic (non-contaminated, moderately-contaminated and highly-contaminated ) using selected irrigation water and soil properties. We have available data of 263 paired groundwater and paddy soil samples from arsenic contaminated areas in Tala Upazilla, Satkhira district, Bangladesh. This data was utilized in a publication titled “Factors Affecting Paddy Soil Arsenic Concentration in Bangladesh: Prediction and Uncertainty of Geostatistical Risk Mapping” which can be accessed via the this URL

Full data set is available for download can download from my Dropbox or from my Github accounts.

We will use read_csv() function of readr package to import data as a tidy data.

Code

# Load data
mf<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/Regression_analysis/bd_soil_arsenic.csv")

Convert Continuous Variables into Ordinal Variables

An ordinal variable is a type of categorical variable in which the categories have a natural order or hierarchy. Unlike interval or ratio variables, the intervals between the categories are not necessarily equal or measurable. This means that while the categories have a meaningful sequence or ranking, the differences between the categories may not be consistent or quantifiable.

We will convert Soil As (SAs) into three classes:

A. Non-contaminated, SAs < 14.8 mg/kg

B. Moderately-contaminated, SAs 14.8 - 20 mg/kg

C. Highly-contaminated: SAs > 20 mg/kg

Note

14.8 mg/kg is the upper baseline soil arsenic concentration for Bangladesh (Ahmed et al, 2011)
20 mg/kg is the permissible limits of arsenic in agricultural soil [(A Heikens, 2006)](https://agris.fao.org/search/en/providers/122621/records/6472474853aa8c8963049da

Code

mf$Class_As<- cut(mf$SAs, 
                   breaks=c(-Inf, 14.8, 20,  Inf), 
                   labels=c("Non-contaminated",
                            "Moderately-contaminated", 
                            "Highly-contaminated"))

We can create a nice looking plots with results of ANOVA and post-hoc tests on the same plot (directly on the boxplots). We will use gbetweenstats() function of ggstatsplot package:

Code

p1<-ggstatsplot::ggbetweenstats(
  data = mf,
  x = Class_As,
  y = WAs,
  ylab = "Water As (ppm)",
  xlab = "As-contamination",
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)+
# add plot title
ggtitle("Water As") +
   theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    axis.line = element_line(colour = "gray"),
    # axis title font size
    axis.title.x = element_text(size = 14), 
    # X and  axis font size
    axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5),
    axis.text.x = element_text(size=8))

p2<-ggstatsplot::ggbetweenstats(
  data = mf,
  x = Class_As,
  y = WFe,
  ylab = "Water Fe (%)",
  xlab = "As-contamination",
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)+
# add plot title
ggtitle("   Water Fe") +
   theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    axis.line = element_line(colour = "gray"),
    # axis title font size
    axis.title.x = element_text(size = 14), 
    # X and  axis font size
    axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5),
    axis.text.x = element_text(size=8))

p3<-ggstatsplot::ggbetweenstats(
  data = mf,
  x = Class_As,
  y = SAoFe,
  ylab = "Soil Oxlate-Fe",
  xlab = "As-contamination",
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)+
# add plot title
ggtitle("Soil Oxalate Extractable Fe") +
   theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    axis.line = element_line(colour = "gray"),
    # axis title font size
    axis.title.x = element_text(size = 14), 
    # X and  axis font size
    axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5),
    axis.text.x = element_text(size=8))

p4<-ggstatsplot::ggbetweenstats(
  data = mf,
  x = Class_As,
  y = SOC,
  ylab = "Soil OC (%)",
  xlab = "As-contamination",
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)+
# add plot title
ggtitle("Soil Organic Carbon") +
   theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    axis.line = element_line(colour = "gray"),
    # axis title font size
    axis.title.x = element_text(size = 14), 
    # X and  axis font size
    axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5),
    axis.text.x = element_text(size=8))

Code

(p1|p2)/(p3|p4)

Data Processing

Code

df <- mf |> 
  # select variables
  dplyr::select (WAs,  WFe,  
                SOC, SAoFe, 
                Year_Irrigation, Distance_STW,
                Land_type, Class_As) |> 
   # convert to factor
   dplyr::mutate_at(vars(Land_type), funs(factor))  |> 
   dplyr::mutate_at(vars(Class_As), funs(factor))  |> 
   # normalize the all numerical features 
   dplyr::mutate_at(1:6,  funs((.-min(.))/max(.-min(.)))) |> 
   glimpse()

Rows: 263
Columns: 8
$ WAs             <dbl> 0.10738255, 0.10738255, 0.15212528, 0.24832215, 0.1364…
$ WFe             <dbl> 0.1804348, 0.3423913, 0.8608696, 0.7391304, 0.3260870,…
$ SOC             <dbl> 0.5333333, 0.3428571, 0.3904762, 0.5095238, 0.3428571,…
$ SAoFe           <dbl> 0.3807107, 0.4238579, 0.2944162, 0.3807107, 0.2690355,…
$ Year_Irrigation <dbl> 0.68421053, 1.00000000, 0.47368421, 0.36842105, 0.4736…
$ Distance_STW    <dbl> 0.04761905, 0.07142857, 0.04761905, 0.11904762, 0.0476…
$ Land_type       <fct> MHL, MHL, MHL, MHL, MHL, MHL, MHL, MHL, MHL, MHL, MHL,…
$ Class_As        <fct> Highly-contaminated, Highly-contaminated, Highly-conta…

Split Data

Code

seeds = 11076
tr_prop = 0.70
# training data (70% data)
train= ddply(df,.(Land_type, Class_As),
                 function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test = ddply(df, .(Land_type, Class_As),
            function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
print(prop.table(table(train$Class_As)))


       Non-contaminated Moderately-contaminated     Highly-contaminated 
              0.3571429               0.3186813               0.3241758

Code

print(prop.table(table(test$Class_As)))


       Non-contaminated Moderately-contaminated     Highly-contaminated 
              0.3456790               0.3333333               0.3209877

Code

levels(train$Class_As)

[1] "Non-contaminated"        "Moderately-contaminated"
[3] "Highly-contaminated"

Fit an Ordinal Regression Model

We’ll use the polr() function from the {MASS} package to fit the model. Include the Hess=TRUE option to speed up subsequent calls to summary().

Only Intercept model

Code

# Fit the ordinal logistic regression model
inter.ordinal<-MASS::polr(Class_As~1, data= train,
             Hess = TRUE)
summary(inter.ordinal)

Call:
MASS::polr(formula = Class_As ~ 1, data = train, Hess = TRUE)

No coefficients

Intercepts:
                                            Value   Std. Error t value
Non-contaminated|Moderately-contaminated    -0.5878  0.1547    -3.7996
Moderately-contaminated|Highly-contaminated  0.7346  0.1584     4.6390

Residual Deviance: 399.4273 
AIC: 403.4273

Full Model

Code

# Fit the ordinal logistic regression model
fit.ordinal<-MASS::polr(Class_As~., data= train,
             Hess = TRUE)
summary(fit.ordinal)

Call:
MASS::polr(formula = Class_As ~ ., data = train, Hess = TRUE)

Coefficients:
                  Value Std. Error t value
WAs              3.6698     1.1037  3.3249
WFe              2.2060     0.7827  2.8185
SOC              1.6400     1.3379  1.2258
SAoFe           -0.6631     1.0238 -0.6477
Year_Irrigation  4.7694     0.8589  5.5532
Distance_STW    -3.5255     1.2632 -2.7909
Land_typeMHL     1.3473     0.3650  3.6909

Intercepts:
                                            Value   Std. Error t value
Non-contaminated|Moderately-contaminated     2.8381  0.7917     3.5847
Moderately-contaminated|Highly-contaminated  5.0791  0.8642     5.8771

Residual Deviance: 284.2497 
AIC: 302.2497

Check the Overall Model Fit

Code

anova(inter.ordinal,fit.ordinal)

Likelihood ratio tests of ordinal regression models

Response: Class_As
                                                                 Model
1                                                                    1
2 WAs + WFe + SOC + SAoFe + Year_Irrigation + Distance_STW + Land_type
  Resid. df Resid. Dev   Test    Df LR stat. Pr(Chi)
1       180   399.4273                              
2       173   284.2497 1 vs 2     7 115.1776       0

From the above output, you see that the chi-square is 284.2497 and p = <0.0001. This means that you can reject the null hypothesis that the model without predictors is as good as the model with the predictors

Goodness of Fit Tests

The Lipsitz goodness-of-fit test is a statistical tool used to determine how well a logistic regression model predicts binary outcomes. It checks if the observed frequencies of the outcome variable in the sample align with the expected frequencies predicted by the logistic regression model.

Code

generalhoslem::lipsitz.test(fit.ordinal)


    Lipsitz goodness of fit test for ordinal response models

data:  formula:  Class_As ~ WAs + WFe + SOC + SAoFe + Year_Irrigation + Distance_STW + formula:      Land_type
LR statistic = 17.772, df = 9, p-value = 0.03792

Calculate p-values

In a summary output or an ordinal model, p-values are not provided. One way to calculate a p-values is by comparing the t-value against the standard normal distribution, similar to a z test. However, this is only accurate with infinite degrees of freedom, but it can be reasonably approximated by large samples. The accuracy decreases as the sample size decreases.

Code

ctable<-coef(summary(fit.ordinal))
ctable

                                                 Value Std. Error    t value
WAs                                          3.6697587  1.1037204  3.3248987
WFe                                          2.2059640  0.7826794  2.8184772
SOC                                          1.6399837  1.3378986  1.2257908
SAoFe                                       -0.6631162  1.0237922 -0.6477059
Year_Irrigation                              4.7694354  0.8588574  5.5532329
Distance_STW                                -3.5254761  1.2632236 -2.7908568
Land_typeMHL                                 1.3472678  0.3650260  3.6908817
Non-contaminated|Moderately-contaminated     2.8380559  0.7917057  3.5847361
Moderately-contaminated|Highly-contaminated  5.0790576  0.8642141  5.8770827

Code

## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
ctable <- cbind(ctable, "p value" = p)
ctable

                                                 Value Std. Error    t value
WAs                                          3.6697587  1.1037204  3.3248987
WFe                                          2.2059640  0.7826794  2.8184772
SOC                                          1.6399837  1.3378986  1.2257908
SAoFe                                       -0.6631162  1.0237922 -0.6477059
Year_Irrigation                              4.7694354  0.8588574  5.5532329
Distance_STW                                -3.5254761  1.2632236 -2.7908568
Land_typeMHL                                 1.3472678  0.3650260  3.6908817
Non-contaminated|Moderately-contaminated     2.8380559  0.7917057  3.5847361
Moderately-contaminated|Highly-contaminated  5.0790576  0.8642141  5.8770827
                                                 p value
WAs                                         8.845065e-04
WFe                                         4.825204e-03
SOC                                         2.202774e-01
SAoFe                                       5.171752e-01
Year_Irrigation                             2.804340e-08
Distance_STW                                5.256873e-03
Land_typeMHL                                2.234780e-04
Non-contaminated|Moderately-contaminated    3.374192e-04
Moderately-contaminated|Highly-contaminated 4.175597e-09

Confidence Intervals

We can calculate confidence intervals for the parameter estimates using two methods: profiling the likelihood function or using standard errors and assuming a normal distribution. It’s important to note that profiled confidence intervals are not always symmetric, although they are typically close to being symmetric. If the 95% confidence interval does not include 0, the parameter estimate is considered statistically significant.

Code

ci <- confint.default(fit.ordinal)
ci

                     2.5 %    97.5 %
WAs              1.5065064  5.833011
WFe              0.6719406  3.739987
SOC             -0.9822493  4.262217
SAoFe           -2.6697121  1.343480
Year_Irrigation  3.0861058  6.452765
Distance_STW    -6.0013488 -1.049603
Land_typeMHL     0.6318300  2.062706

Odds Ratio

The coefficients in the model can be hard to understand because they are scaled in logs. Another way to interpret logistic regression models is to change the coefficients into odds ratios. To find the odds ratio and confidence intervals, we just need to raise the estimates and confidence intervals to the power of the mathematical constant e.

Code

## OR and CI
exp(cbind(OR = coef(fit.ordinal), ci))

                          OR        2.5 %      97.5 %
WAs              39.24243587  4.510943773 341.3850517
WFe               9.07899994  1.958033425  42.0974632
SOC               5.15508558  0.374467868  70.9671232
SAoFe             0.51524322  0.069272167   3.8323555
Year_Irrigation 117.85268444 21.891660531 634.4541663
Distance_STW      0.02943779  0.002475411   0.3500766
Land_typeMHL      3.84690063  1.881049711   7.8672267

The tbl_regression() function from the gtsummary package takes a regression model object as input and produces a formatted table with Odd-ratio and confidence interval.

Code

tbl_regression(fit.ordinal,  exp = TRUE)

Characteristic	OR	95% CI
WAs	39.2	4.79, 368
WFe	9.08	2.00, 43.5
SOC	5.16	0.38, 74.9
SAoFe	0.52	0.07, 3.84
Year_Irrigation	118	23.1, 677
Distance_STW	0.03	0.00, 0.32
Land_type
HL	—	—
MHL	3.85	1.89, 7.95
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

The tab_model() function of {sjPlot} package also creates HTML tables from regression models:

Code

tab_model(fit.ordinal)

	Class_As
Predictors	Odds Ratios	CI	p
Non-contaminated\|Moderately-contaminated	17.08	3.58 – 81.51	<0.001
Moderately-contaminated\|Highly-contaminated	160.62	29.17 – 884.32	<0.001
WAs	39.24	4.79 – 367.93	0.001
WFe	9.08	2.00 – 43.48	0.005
SOC	5.16	0.38 – 74.85	0.222
SAoFe	0.52	0.07 – 3.84	0.518
Year Irrigation	117.85	23.12 – 676.76	<0.001
Distance STW	0.03	0.00 – 0.32	0.006
Land type [MHL]	3.85	1.89 – 7.95	<0.001
Observations	182
R² Nagelkerke	0.528

plot_model() function of {sjPlot} package creates plots the estimates from logistic model:

Code

plot_model(fit.ordinal, vline.color = "red")

In the summary output, you’ll see coefficients for each predictor variable, along with their standard errors and z-values. These coefficients represent the log odds ratios.

Interpret the Odds Ratios

Exponentiate the coefficient: This converts the log odds ratio to an odds ratio. For example, if the coefficient for WAs is 8.195, the odds ratio associated with a one-unit increase in WAs is $e^{8.195}$
Interpret the odds ratio: When the odds ratio is greater than 1, it means that for each one-unit increase in the predictor variable, the odds of being in a higher category of the outcome increase by the value of the odds ratio. On the other hand, if the odds ratio is less than 1, it indicates that for each one-unit increase in the predictor variable, the odds of being in a higher category of the outcome decrease by the reciprocal of the odds ratio.
Check the significance: When interpreting the results, it’s important to pay attention to the p-values associated with each coefficient. A p-value that is less than 0.05 suggests that the odds ratio is significantly different from 1, indicating a strong association between the predictor variable and the outcome category. It’s crucial to remember that interpretation should always be done in the context of the specific research question and the nature of the data being analyzed.

Model Performance

Code

performance::performance(fit.ordinal)

Can't calculate log-loss.

# Indices of model performance

AIC     |    AICc |     BIC | Nagelkerke's R2 |  RMSE | Sigma
-------------------------------------------------------------
302.250 | 303.296 | 331.086 |           0.528 | 1.848 | 1.274

Fitted Values

Code

fitted.values<-fit.ordinal$fitted.values  
head(fitted.values, 20)

   Non-contaminated Moderately-contaminated Highly-contaminated
1         0.8631743              0.12024687         0.016578838
2         0.6867268              0.26700216         0.046271076
3         0.6603992              0.28774676         0.051854054
4         0.4405735              0.44045087         0.118975592
5         0.6659947              0.28336912         0.050636133
6         0.9401567              0.05311928         0.006724048
7         0.8908020              0.09632876         0.012869261
8         0.2634417              0.50736032         0.229198020
9         0.8173248              0.15945703         0.023218159
10        0.9608072              0.03487328         0.004319519
11        0.1295409              0.45367217         0.416786952
12        0.6671589              0.28245621         0.050384933
13        0.9135523              0.07648411         0.009963608
14        0.8958100              0.09197154         0.012218457
15        0.9531342              0.04166365         0.005202141
16        0.7937714              0.17934039         0.026888184
17        0.7784661              0.19215760         0.029376270
18        0.9104555              0.07919293         0.010351577
19        0.8695884              0.11471249         0.015699138
20        0.6293644              0.31169596         0.058939689

Test the Goodness of Fit

Code

# Test the goodness of fit
chisq.test(train$Class_As,predict(fit.ordinal))


    Pearson's Chi-squared test

data:  train$Class_As and predict(fit.ordinal)
X-squared = 96.166, df = 4, p-value < 2.2e-16

Confusion Matrix

Code

predict.train <-  predict(fit.ordinal,train)
c.train <- table(train$Class_As,predict.train)
c.train

                         predict.train
                          Non-contaminated Moderately-contaminated
  Non-contaminated                      49                      14
  Moderately-contaminated               15                      26
  Highly-contaminated                    2                      15
                         predict.train
                          Highly-contaminated
  Non-contaminated                          2
  Moderately-contaminated                  17
  Highly-contaminated                      42

Misclassification Error

Code

mean(as.character(train$Class_As) != as.character(predict.train))

[1] 0.3571429

Marginal Effects and Adjusted Predictions

Code

# Make predictions on new data
margins::margins(fit.ordinal, variables = "WAs")

Average marginal effects

MASS::polr(formula = Class_As ~ ., data = train, Hess = TRUE)

     WAs
 -0.4868

ggeffects supports labelled data and the plot()- method automatically sets titles, axis - and legend-labels depending on the value and variable labels of the data.

Code

predict_response(fit.ordinal, terms = c("WAs", "WFe")) |> 
plot(facets = TRUE)

Cross-validation

Code

# Set the number of folds
K <- 5

# Create a vector to store misclassification rates for each fold
misclassification_rates <- numeric(K)

# Set seed for reproducibility
set.seed(123)

# Generate fold indices
folds <- sample(rep(1:K, length.out = nrow(df)))

# Loop through each fold
for (k in 1:K) {
  # Split the data into training and test sets
  train_data <- df[folds != k, ]
  test_data <- df[folds == k, ]
  
  # Fit the model on the training data
  model<-MASS::polr(Class_As~., data= df,
             Hess = TRUE)
  
  # Predict the class labels on the test set
    predicted_classes <- predict(model, newdata = test_data,  type = "class")
  
  # Calculate the misclassification rate for the current fold
  misclassified <- mean(predicted_classes == test_data$Class_As)
  fold_misclassification_rate <- misclassified / nrow(test_data)
  
  # Store the misclassification rate for this fold
  misclassification_rates[k] <- fold_misclassification_rate
}

# Calculate the average misclassification rate across all folds
avg_misclassification_rate <- mean(misclassification_rates)

# Display results
cat("Cross-Validated Misclassification Rates for Each Fold:", misclassification_rates, "\n")

Cross-Validated Misclassification Rates for Each Fold: 0.01281595 0.01174795 0.01317195 0.01368343 0.008505917

Code

cat("Average Cross-Validated Misclassification Rate:", avg_misclassification_rate, "\n")

Average Cross-Validated Misclassification Rate: 0.01198504

Prediction at Data

Predicted Classes

Code

##PREDICT 
predictedClass <- predict(fit.ordinal, test)  
head(predictedClass)

[1] Moderately-contaminated Non-contaminated        Non-contaminated       
[4] Non-contaminated        Non-contaminated        Non-contaminated       
Levels: Non-contaminated Moderately-contaminated Highly-contaminated

Predicted Probabilities

Code

predictedScores <- predict(fit.ordinal, test, type="p")  
head(predictedScores)

  Non-contaminated Moderately-contaminated Highly-contaminated
1        0.3471505               0.4861795          0.16667009
2        0.8299982               0.1486830          0.02131881
3        0.8603705               0.1226625          0.01696701
4        0.8300269               0.1486586          0.02131457
5        0.8088798               0.1666076          0.02451262
6        0.8108486               0.1649426          0.02420873

Confusion Matrix

Code

table(test$Class_As, predictedClass)  # confusion matrix

                         predictedClass
                          Non-contaminated Moderately-contaminated
  Non-contaminated                      21                       5
  Moderately-contaminated                8                      10
  Highly-contaminated                    2                       4
                         predictedClass
                          Highly-contaminated
  Non-contaminated                          2
  Moderately-contaminated                   9
  Highly-contaminated                      20

Mis-classifiaction Error and Accuracy

Code

mean(as.character(test$Class_As) != as.character(predictedClass))

[1] 0.3703704

Summary and Conclusion

Understanding ordinal regression and how to implement it in R can provide valuable insights into the relationships between predictor variables and ordinal response variables. This can lead to a deeper understanding of the factors influencing the outcome of interest. It’s important to approach ordinal regression carefully, taking into consideration the specific data and research question.

References

Session Info

Code

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggstatsplot_0.13.0     kableExtra_1.4.0       generalhoslem_1.3.4   
 [4] reshape_0.8.9          MASS_7.3-64            RColorBrewer_1.1-3    
 [7] metrica_2.1.0          ggpmisc_0.6.1          ggpp_0.5.8-1          
[10] Metrics_0.1.4          patchwork_1.3.0        ggeffects_2.2.0       
[13] marginaleffects_0.25.0 margins_0.3.28         sjPlot_2.8.17         
[16] jtools_2.3.0           performance_0.13.0     report_0.6.1          
[19] gtsummary_2.1.0        rstatix_0.7.2          gt_0.11.1             
[22] plyr_1.8.9             lubridate_1.9.4        forcats_1.0.0         
[25] stringr_1.5.1          dplyr_1.1.4            purrr_1.0.4           
[28] readr_2.1.5            tidyr_1.3.1            tibble_3.2.1          
[31] ggplot2_3.5.1          tidyverse_2.0.0        numDeriv_2016.8-1.1   

loaded via a namespace (and not attached):
  [1] rstudioapi_0.17.1      jsonlite_1.9.0         datawizard_1.0.2      
  [4] correlation_0.8.7      magrittr_2.0.3         TH.data_1.1-3         
  [7] estimability_1.5.1     farver_2.1.2           rmarkdown_2.29        
 [10] minerva_1.5.10         vctrs_0.6.5            memoise_2.0.1         
 [13] paletteer_1.6.0        base64enc_0.1-3        effectsize_1.0.0      
 [16] htmltools_0.5.8.1      energy_1.7-12          polynom_1.4-1         
 [19] haven_2.5.4            curl_6.2.1             broom_1.0.7           
 [22] Formula_1.2-5          sjmisc_2.8.10          sass_0.4.9            
 [25] parallelly_1.42.0      htmlwidgets_1.6.4      sandwich_3.1-1        
 [28] emmeans_1.11.0         zoo_1.8-13             cachem_1.1.0          
 [31] commonmark_1.9.2       lifecycle_1.0.4        pkgconfig_2.0.3       
 [34] sjlabelled_1.2.0       Matrix_1.7-1           R6_2.6.1              
 [37] fastmap_1.2.0          future_1.34.0          digest_0.6.37         
 [40] colorspace_2.1-1       rematch2_2.1.2         furrr_0.3.1           
 [43] prismatic_1.1.2        RSQLite_2.3.9          labeling_0.4.3        
 [46] timechange_0.3.0       abind_1.4-8            compiler_4.4.2        
 [49] bit64_4.6.0-1          withr_3.0.2            pander_0.6.5          
 [52] gsl_2.1-8              backports_1.5.0        carData_3.0-5         
 [55] DBI_1.2.3              prediction_0.3.18      ggsignif_0.6.4        
 [58] broom.mixed_0.2.9.6    quantreg_6.1           sjstats_0.19.0        
 [61] tools_4.4.2            statsExpressions_1.6.2 glue_1.8.0            
 [64] nlme_3.1-166           grid_4.4.2             generics_0.1.3        
 [67] gtable_0.3.6           labelled_2.14.0        tzdb_0.4.0            
 [70] data.table_1.17.0      hms_1.1.3              xml2_1.3.6            
 [73] car_3.1-3              markdown_1.13          pillar_1.10.1         
 [76] vroom_1.6.5            splines_4.4.2          lattice_0.22-5        
 [79] survival_3.8-3         bit_4.5.0.1            SparseM_1.84-2        
 [82] tidyselect_1.2.1       knitr_1.49             svglite_2.1.3         
 [85] xfun_0.51              stringi_1.8.4          yaml_2.3.10           
 [88] boot_1.3-31            evaluate_1.0.3         codetools_0.2-20      
 [91] cli_3.6.4              RcppParallel_5.1.10    xtable_1.8-4          
 [94] parameters_0.24.2      systemfonts_1.2.1      munsell_0.5.1         
 [97] Rcpp_1.0.14            globals_0.16.3         zeallot_0.1.0         
[100] coda_0.19-4.1          parallel_4.4.2         rstantools_2.4.0      
[103] MatrixModels_0.5-3     blob_1.2.4             bayestestR_0.15.2     
[106] listenv_0.9.1          broom.helpers_1.20.0   viridisLite_0.4.2     
[109] mvtnorm_1.3-3          scales_1.3.0           crayon_1.5.3          
[112] insight_1.1.0          rlang_1.1.5            multcomp_1.4-28       
[115] cards_0.5.0.9000