2. Factor Analysis

Factor Analysis uses to identify a smaller number of unobserved variables, called factors, from a larger set of observed variables. However, unlike PCA, Factor Analysis allows for the possibility that the observed variables may be correlated with each other due to shared underlying factors. Factor Analysis assumes that the observed variables are linearly related to the unobserved factors, and seeks to identify these factors that account for the observed correlations among the variables.

Overview

Factor Analysis (FA) is a statistical method used to identify the underlying relationships (factors) between observed variables. It assumes that multiple observed variables can be explained by a smaller number of unobserved latent variables (factors) and that the variability in the observed data is due to these latent variables plus some unique random error.

Mathematical framework of Factor Analysis follows:

Model Representation:

The factor analysis model can be expressed as:

\[ \mathbf{X} = \mathbf{\Lambda} \mathbf{F} + \mathbf{\epsilon} \] where:

\(\mathbf{X} \in \mathbb{R}^p\): A vector of \(p\) observed variables.
\(\mathbf{\Lambda} \in \mathbb{R}^{p \times k}\): The loading matrix, where each entry \(\lambda_{ij}\) represents the loading (influence) of factor \(j\) on variable \(i\).
\(\mathbf{F} \in \mathbb{R}^k\): A vector of \(k\) unobserved latent factors (\(k < p\)).
\(\mathbf{\epsilon} \in \mathbb{R}^p\): A vector of unique errors for each observed

Assumptions:

The factors \(\mathbf{F}\) are uncorrelated: \(\text{Cov}(\mathbf{F}) = \mathbf{I}_k\) (identity matrix).
The errors \(\mathbf{\epsilon}\) are uncorrelated with each other and with the factors: \(\text{Cov}(\mathbf{\epsilon}) = \mathbf{\Psi}\), where \(\mathbf{\Psi}\) is a diagonal matrix.
Observed variables are related to the factors through linear combinations.

Covariance Structure:

From the model, the covariance matrix of the observed variables () can be expressed as:

\[ \text{Cov}(\mathbf{X}) = \mathbf{\Sigma} = \mathbf{\Lambda} \mathbf{\Lambda}^T + \mathbf{\Psi} \] where:

\(\mathbf{\Lambda} \mathbf{\Lambda}^T\): The shared variance explained by the common factors.
\(\mathbf{\Psi}\): The unique variances (specific to each variable).

Variance Decomposition:

For each variable \(i\):

\[ \text{Var}(X_i) = \sum_{j=1}^k \lambda_{ij}^2 + \psi_i \]

where:

\(\sum_{j=1}^k \lambda_{ij}^2\): The communality, or the variance explained by the factors.
\(\psi_i\): The uniqueness, or the variance unique to the variable.

Objective of Factor Analysis:

To estimate the loading matrix \(\mathbf{\Lambda}\) and the unique variances \(\mathbf{\Psi}\).
To reduce dimensionality and interpret the latent structure in the data.

Factor Extraction:

There are several methods for extracting factors, including: - Principal Factor Method: Uses the eigenvalues and eigenvectors of the reduced covariance matrix. - Maximum Likelihood Estimation (MLE): Estimates parameters by maximizing the likelihood of the observed data under the FA model.

Factor Rotation:

After extraction, the factors can be rotated to achieve a simpler and more interpretable structure: - Orthogonal Rotation (e.g., Varimax): Keeps the factors uncorrelated. - Oblique Rotation (e.g., Promax): Allows factors to be correlated.

Goodness of Fit:

To assess the fit of the factor model, compare the model-implied covariance matrix \(\mathbf{\Sigma}\) with the observed covariance matrix using measures such as:

Kaiser-Meyer-Olkin (KMO) statistic: Measures sampling adequacy for factor analysis.
Bartlett’s Test of Sphericity: Tests the null hypothesis that the variables are uncorrelated.

Factor analysis is widely used in psychology, social sciences, finance, and other fields to uncover latent structures and simplify complex datasets.

Factor Analysis from Scratch

We will implement Factor Analysis from scratch in R to understand the underlying concepts and algorithms. We will generate synthetic data, compute the correlation matrix, and estimate the factor loadings and unique variances using Maximum Likelihood and Minimum Residual methods.

Synthetic Data Generation

Let’s generate some synthetic data to demonstrate Factor Analysis. We will create a dataset with 6 observed variables influenced by 2 latent factors. The observed variables will be generated as linear combinations of the latent factors with some added noise.

Code

# Generate synthetic data
set.seed(123) # For reproducibility
n <- 200  # Number of observations
p <- 6    # Number of variables

# Generate latent factors
factor1 <- rnorm(n)
factor2 <- rnorm(n)

# Create observed variables influenced by the factors
X <- data.frame(
  V1 = 0.8 * factor1 + 0.2 * rnorm(n),
  V2 = 0.7 * factor1 + 0.3 * rnorm(n),
  V3 = 0.6 * factor1 + 0.4 * rnorm(n),
  V4 = 0.7 * factor2 + 0.3 * rnorm(n),
  V5 = 0.8 * factor2 + 0.2 * rnorm(n),
  V6 = 0.6 * factor2 + 0.4 * rnorm(n)
)
head(X)

          V1          V2         V3          V4          V5         V6
1 -0.4630917 -0.07012927 -0.1937720  1.24042763  1.88301829  0.8848389
2 -0.4178723 -0.16932833 -0.4013106  0.60670257  0.89842835  0.5213267
3  1.1200170  1.08109672  1.2773059 -0.19099561 -0.04181111  0.1268523
4  0.0506384 -0.40546441  0.5034795  0.34058330  0.28496925  0.1532520
5  0.2375694  0.32761702  0.1880825 -1.05484080 -0.20542399 -0.1575580
6  1.0419427  1.13732524  1.0866809 -0.02120079 -0.16166519  0.2322302

Standardization and Correlation Matrix

Code

# Standardize data
X_scale<- scale(X, center = TRUE, scale = FALSE)

Compute Correlation Matrix

The next step is to compute the correlation matrix of the centered data.

Code

# Compute the correlation matrix
R <- cor(X_scale)
R

            V1           V2         V3          V4           V5          V6
V1  1.00000000  0.877874232 0.76677643 -0.01294144 -0.052094216 -0.01321996
V2  0.87787423  1.000000000 0.70296402  0.03014416 -0.003693208  0.01919607
V3  0.76677643  0.702964019 1.00000000  0.03225977  0.044670131  0.04355680
V4 -0.01294144  0.030144156 0.03225977  1.00000000  0.902855278  0.79794504
V5 -0.05209422 -0.003693208 0.04467013  0.90285528  1.000000000  0.81138965
V6 -0.01321996  0.019196075 0.04355680  0.79794504  0.811389651  1.00000000

Kaiser-Meyer-Olkin (KMO) Measure

The Kaiser-Meyer-Olkin (KMO) measure is a statistic that indicates the degree to which the variables in the dataset are suitable for factor analysis. It is a measure of sampling adequacy, with values between 0 and 1. Higher values indicate that the variables are more suitable for factor analysis.

Code

# Kaiser-Meyer-Olkin (KMO) Measure
kmo <- function(R) {
  R_inv <- solve(R)
  partial_corr <- -R_inv / sqrt(outer(diag(R_inv), diag(R_inv)))
  diag(partial_corr) <- 0

  numerator <- sum(R^2) - sum(diag(R)^2)
  denominator <- numerator + sum(partial_corr^2)

  KMO <- numerator / denominator

  individual_kmo <- colSums(R^2) - diag(R)^2
  individual_kmo <- individual_kmo / (individual_kmo + colSums(partial_corr^2))

  return(list(KMO = KMO, individual = individual_kmo))
}

kmo_result <- kmo(R)
cat("Overall KMO Measure:", kmo_result$KMO, "\n")

Overall KMO Measure: 0.7198511

Code

cat("Individual KMO Measures:\n")

Individual KMO Measures:

Code

print(kmo_result$individual)

       V1        V2        V3        V4        V5        V6 
0.6370567 0.6953375 0.8098881 0.7014303 0.6788390 0.8735113

Maximum Likelihood Factor Analysis

We will implement Maximum Likelihood Factor Analysis from scratch using the iterative algorithm to estimate the factor loadings and unique variances. The algorithm involves iteratively updating the factor loadings and communalities until convergence.

Code

# Maximum Likelihood Factor Analysis
ml_factor_analysis <- function(R, n_factors, max_iter = 100, tol = 1e-6) {
  p <- ncol(R)
  diag_R <- diag(R)
  communality <- diag_R

  for (i in 1:max_iter) {
    # Spectral decomposition
    eig <- eigen(R - diag(diag_R - communality))
    loadings <- eig$vectors[, 1:n_factors] %*% diag(sqrt(eig$values[1:n_factors]))

    # Update communalities
    new_communality <- rowSums(loadings^2)

    # Check convergence
    if (max(abs(new_communality - communality)) < tol) {
      break
    }

    communality <- new_communality
  }

  uniqueness <- diag_R - communality
  return(list(loadings = loadings, uniqueness = uniqueness, communality = communality))
}

ml_result <- ml_factor_analysis(R, n_factors = 2)
cat("\nMaximum Likelihood Factor Loadings:\n")


Maximum Likelihood Factor Loadings:

Code

print(ml_result$loadings)

           [,1]       [,2]
[1,] -0.1018410 -0.9755628
[2,] -0.1355767 -0.8860696
[3,] -0.1476192 -0.7704355
[4,] -0.9339530  0.1168696
[5,] -0.9488113  0.1440653
[6,] -0.8402762  0.1039804

Code

cat("\nUnique Variances:\n")


Unique Variances:

Code

print(ml_result$uniqueness)

        V1         V2         V3         V4         V5         V6 
0.03790653 0.19649917 0.38463754 0.11407325 0.07900237 0.28312396

Minimum Residual Factor Analysis

We will also implement Minimum Residual Factor Analysis from scratch using the iterative algorithm to estimate the factor loadings and unique variances. The algorithm involves iteratively updating the factor loadings and communalities until convergence.

Code

# Minimum Residual Factor Analysis
minres_factor_analysis <- function(R, n_factors, max_iter = 100, tol = 1e-6) {
  p <- ncol(R)
  diag_R <- diag(R)
  communality <- diag_R

  for (i in 1:max_iter) {
    # Reduce the correlation matrix
    reduced_R <- R - diag(diag_R - communality)

    # Spectral decomposition
    eig <- eigen(reduced_R)
    loadings <- eig$vectors[, 1:n_factors] %*% diag(sqrt(eig$values[1:n_factors]))

    # Update communalities
    new_communality <- rowSums(loadings^2)

    # Check convergence
    if (max(abs(new_communality - communality)) < tol) {
      break
    }

    communality <- new_communality
  }

  uniqueness <- diag_R - communality
  return(list(loadings = loadings, uniqueness = uniqueness, communality = communality))
}

minres_result <- minres_factor_analysis(R, n_factors = 2)
cat("\nMinimum Residual Factor Loadings:\n")


Minimum Residual Factor Loadings:

Code

print(minres_result$loadings)

           [,1]       [,2]
[1,] -0.1018410 -0.9755628
[2,] -0.1355767 -0.8860696
[3,] -0.1476192 -0.7704355
[4,] -0.9339530  0.1168696
[5,] -0.9488113  0.1440653
[6,] -0.8402762  0.1039804

Code

cat("\nUnique Variances:\n")


Unique Variances:

Code

print(minres_result$uniqueness)

        V1         V2         V3         V4         V5         V6 
0.03790653 0.19649917 0.38463754 0.11407325 0.07900237 0.28312396

Factor Analysis in R

In R, Factor Analysis can be performed using the {psych} package, which provides functions for both exploratory and confirmatory factor analysis. The {lavaan} package can be used for structural equation modeling, including confirmatory factor analysis.

Install Required R Packages

Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:

Code

packages <- c('tidyverse', 
              'plyr',
              'corrr',
              'ggcorrplot',
              'factoextra',
              'ade4',
              'psych',
              'FactoMineR',
              'lavaan',
              'scico'
         )

#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Load Packages

Code

# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Code

# Check loaded packages
cat("Successfully loaded packages:\n")

Successfully loaded packages:

Code

print(search()[grepl("package:", search())])

 [1] "package:scico"      "package:lavaan"     "package:FactoMineR"
 [4] "package:psych"      "package:ade4"       "package:factoextra"
 [7] "package:ggcorrplot" "package:corrr"      "package:plyr"      
[10] "package:lubridate"  "package:forcats"    "package:stringr"   
[13] "package:dplyr"      "package:purrr"      "package:readr"     
[16] "package:tidyr"      "package:tibble"     "package:ggplot2"   
[19] "package:tidyverse"  "package:stats"      "package:graphics"  
[22] "package:grDevices"  "package:utils"      "package:datasets"  
[25] "package:methods"    "package:base"

Data

In this exercise we will use following data set PCA analysis. This dataset contains heavy metal concentration in the soil of USA.

usa_geochemical_FA.csv

We will use read_csv() function of {readr} package to import data as a tidy data from my github repository. The glimpse() function from {dplyr} package can be used to get a quick overview of the data

Code

# Load data
mf<-readr::read_csv("https://github.com/zia207/r-colab/raw/main/Data/Regression_analysis/usa_geochemical_FA.csv")
glimpse(mf)

Rows: 4,813
Columns: 28
$ A_LabID   <chr> "C-320500", "C-299963", "C-319572", "C-355112", "C-319622", …
$ SiteID    <dbl> 847, 7209, 6811, 4072, 4315, 12782, 10830, 252, 1964, 6920, …
$ StateID   <chr> "NV", "ME", "CO", "WI", "CO", "MT", "WA", "PA", "IN", "MO", …
$ Latitude  <dbl> 38.8486, 44.2417, 37.8284, 42.7492, 39.2340, 46.3279, 46.499…
$ Longitude <dbl> -117.5824, -70.4833, -107.6216, -90.2205, -106.2517, -112.90…
$ LandCover <chr> "Shrubland", "Forested Upland", "Shrubland", "Forested Uplan…
$ Al        <dbl> 7.43, 5.35, 5.82, 4.61, 6.24, 3.76, 6.99, 7.53, 9.06, 3.02, …
$ As        <dbl> 195.0, 2.7, 18.4, 7.7, 25.1, 565.0, 4.2, 10.5, 94.5, 9.0, 5.…
$ Ba        <dbl> 496, 352, 686, 610, 1030, 399, 706, 466, 414, 317, 372, 654,…
$ Cd        <dbl> 16.1, 0.5, 5.1, 0.6, 3.2, 2.6, 0.6, 0.2, 1.4, 1.4, 1.1, 1.2,…
$ Ce        <dbl> 42.0, 43.0, 67.9, 51.9, 48.9, 28.0, 67.9, 75.5, 104.0, 47.6,…
$ Co        <dbl> 11.1, 5.2, 14.4, 9.5, 6.8, 8.9, 13.3, 21.1, 56.8, 11.7, 4.8,…
$ Cr        <dbl> 42, 63, 24, 38, 9, 20, 48, 81, 217, 56, 24, 16, 59, 13, 54, …
$ Cs        <dbl> 26.0, 2.5, 12.0, 2.5, 2.5, 5.0, 5.0, 2.5, 7.0, 2.5, 2.5, 2.5…
$ Cu        <dbl> 87.9, 16.5, 141.0, 35.4, 147.0, 232.0, 34.7, 79.8, 98.7, 13.…
$ Fe        <dbl> 4.17, 1.59, 3.63, 2.26, 7.39, 13.00, 3.44, 5.29, 8.28, 2.40,…
$ Hg        <dbl> 8.24, 0.04, 0.07, 0.04, 1.32, 0.44, 0.03, 0.08, 0.32, 0.06, …
$ Li        <dbl> 128, 23, 51, 19, 11, 28, 24, 37, 71, 16, 14, 57, 29, 29, 23,…
$ Mg        <dbl> 1.62, 0.31, 0.77, 0.57, 0.34, 0.62, 0.80, 0.65, 0.39, 0.16, …
$ Mn        <dbl> 740, 635, 4040, 719, 812, 6580, 713, 879, 480, 894, 430, 233…
$ Mo        <dbl> 9.16, 1.06, 9.41, 0.87, 4.23, 70.30, 0.72, 2.10, 5.02, 1.33,…
$ Ni        <dbl> 25.4, 34.8, 9.8, 20.3, 4.0, 7.2, 21.1, 31.9, 217.0, 11.1, 9.…
$ Pb        <dbl> 2200.0, 34.0, 865.0, 49.1, 1510.0, 62.5, 83.0, 268.0, 259.0,…
$ Rb        <dbl> 110.0, 79.0, 143.0, 72.5, 89.1, 46.6, 64.7, 88.6, 69.8, 51.6…
$ Sb        <dbl> 32.30, 1.60, 6.04, 1.21, 6.26, 20.80, 0.77, 1.19, 9.84, 1.13…
$ Se        <dbl> 3.1, 0.2, 0.4, 0.5, 0.7, 0.4, 0.1, 0.9, 1.3, 0.6, 0.4, 0.1, …
$ Sn        <dbl> 3.2, 2.5, 2.1, 1.3, 66.9, 32.6, 2.0, 2.5, 82.4, 1.8, 9.7, 5.…
$ Zn        <dbl> 2130, 1440, 977, 943, 770, 689, 612, 541, 536, 518, 491, 491…

First we will create a subset of the numeric variables from the dataset:

Code

df<- mf[,7:28]

Scale the data

In R, scale() is generic function whose default method centers and/or scales the columns of a numeric matrix.

Code

df_scale <- as.data.frame(scale(df, center=TRUE, scale=TRUE)) 
describe(df_scale)

   vars    n mean sd median trimmed  mad   min   max range  skew kurtosis   se
Al    1 4813    0  1   0.03    0.01 1.00 -2.15  5.08  7.24  0.04    -0.08 0.01
As    2 4813    0  1  -0.07   -0.06 0.17 -0.32 56.16 56.49 42.86  2214.48 0.01
Ba    3 4813    0  1  -0.03   -0.04 0.90 -1.78 14.97 16.76  1.89    17.11 0.01
Cd    4 4813    0  1  -0.08   -0.08 0.19 -0.28 59.91 60.18 47.05  2713.71 0.01
Ce    5 4813    0  1  -0.04   -0.06 0.80 -2.01 16.55 18.56  2.45    22.93 0.01
Co    6 4813    0  1  -0.16   -0.13 0.62 -1.11 21.37 22.48  6.02    88.32 0.01
Cr    7 4813    0  1  -0.07   -0.08 0.20 -0.41 42.69 43.10 30.67  1126.12 0.01
Cs    8 4813    0  1  -0.30   -0.19 0.00 -0.30 34.33 34.63 14.81   391.23 0.01
Cu    9 4813    0  1  -0.07   -0.06 0.12 -0.26 67.14 67.40 62.95  4218.14 0.01
Fe   10 4813    0  1  -0.14   -0.11 0.74 -1.50  8.04  9.54  1.96     7.56 0.01
Hg   11 4813    0  1  -0.12   -0.08 0.09 -0.20 47.15 47.35 35.25  1446.40 0.01
Li   12 4813    0  1  -0.12   -0.10 0.72 -1.48 20.48 21.96  3.85    47.50 0.01
Mg   13 4813    0  1  -0.20   -0.15 0.61 -0.92 19.99 20.91  5.83    75.08 0.01
Mn   14 4813    0  1  -0.23   -0.15 0.67 -1.14 11.49 12.64  3.27    21.66 0.01
Mo   15 4813    0  1  -0.13   -0.11 0.26 -0.57 39.51 40.07 21.58   672.27 0.01
Ni   16 4813    0  1  -0.09   -0.08 0.16 -0.34 42.16 42.49 31.40  1154.88 0.01
Pb   17 4813    0  1  -0.09   -0.08 0.16 -0.47 46.76 47.23 31.12  1252.04 0.01
Rb   18 4813    0  1  -0.02   -0.04 0.86 -1.91 11.38 13.30  1.06     5.90 0.01
Sb   19 4813    0  1  -0.03   -0.03 0.04 -0.09 68.89 68.98 67.97  4677.92 0.01
Se   20 4813    0  1  -0.34   -0.17 0.44 -0.63 23.49 24.12  6.70   105.31 0.01
Sn   21 4813    0  1  -0.05   -0.04 0.10 -0.26 63.38 63.64 54.72  3374.51 0.01
Zn   22 4813    0  1  -0.09   -0.09 0.57 -1.06 34.33 35.39 13.95   376.31 0.01

Correlation Analysis

We should take a look at the correlations among among heavy metals to determine if factor analysis is appropriate. We will use corr() function {corrplot} package.

Code

#library(corrplot)
corr_matrix <- cor(df_scale)
ggcorrplot(corr_matrix)

The result of the correlation matrix can be interpreted as follow:

The higher the value, the most positively correlated the two variables are.
The closer the value to -1, the most negatively correlated they are.

Exploratory Factor Analysis (EFA)

Kaiser-Meyer-Olkin (KMO)

Kaiser-Meyer-Olkin (KMO) is a statistical measure used to assess the sampling adequacy of a set of variables for factor analysis. It is named after the three researchers who developed it: Henry Kaiser, J.B. Meyer, and Norbert Olkin. The KMO statistic is a value between 0 and 1, with higher values indicating better sampling adequacy. A KMO value of 0.5 or lower is considered unacceptable, while a value of 0.6 to 0.7 is marginal, and values above 0.8 are generally considered good. The KMO statistic is calculated by comparing the magnitude of the correlation coefficients between pairs of variables to the magnitude of the partial correlations between those variables, controlling for all other variables in the analysis. If the correlation coefficients are much larger than the partial correlations, then the KMO statistic will be high, indicating that the variables are suitable for factor analysis.

The KMO statistic is widely used in the field of psychology and social sciences to assess the quality of data for factor analysis. It is an important tool for researchers to evaluate the suitability of their data for factor analysis and to make decisions about how to proceed with their analysis.

In R, you can compute the Kaiser-Meyer-Olkin (KMO) statistic using KMO() function of the {psych} package. Here’s an example:

Code

# compute the KMO statistic
KMO(r=cor(df_scale))

Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = cor(df_scale))
Overall MSA =  0.74
MSA for each item = 
  Al   As   Ba   Cd   Ce   Co   Cr   Cs   Cu   Fe   Hg   Li   Mg   Mn   Mo   Ni 
0.79 0.54 0.80 0.87 0.84 0.82 0.71 0.80 0.84 0.77 0.78 0.83 0.84 0.82 0.57 0.66 
  Pb   Rb   Sb   Se   Sn   Zn 
0.66 0.75 0.50 0.65 0.57 0.84

According to Kaiser’s (1974) guidelines, a suggested cutoff for determining the factorability of the sample data is KMO ≥ 60. The total KMO is 0.74, indicating that, based on this test, we can probably conduct a factor analysis.

Bartlett’s Test of Sphericity

Bartlett’s Test of Sphericity is a statistical test used to determine whether a correlation matrix is significantly different from an identity matrix. It is commonly used in factor analysis to assess whether the variables in a dataset are suitable for dimensionality reduction.

The null hypothesis of Bartlett’s test is that the variables are uncorrelated in the population, while the alternative hypothesis is that they are correlated. If the test statistic is large and the p-value is small (e.g., less than 0.05), then we reject the null hypothesis and conclude that the correlation matrix is significantly different from an identity matrix, and therefore the variables are correlated.

In R, you can perform Bartlett’s test of sphericity using bartlett() function of the {psych} package. Here’s an example:

Code

cortest.bartlett(df_scale)

$chisq
[1] 53549.84

$p.value
[1] 0

$df
[1] 231

Small values (p < 0.05) of the significance level indicate that a factor analysis may be useful with our data.

Determine Number of Factors to Extract

Since we are seeking latent structure within a set of data, we will only be interested in factors that explain a substantial proportion of variation within the data. One way to determine the number of factors or components in a data matrix or a correlation matrix is to examine the scree() of {psych} package plot of the successive eigenvalues. Sharp breaks in the plot suggest the appropriate number of components or factors to extract. First we use eigen() function from R-base to computes eigenvalues and eigenvectors and plot the successive eigen values for a scree test using scree() function:

Code

ev <- eigen(cor(df_scale)) # get eigenvalues
ev$values

 [1] 5.16673191 2.65225466 2.29148907 1.59301574 1.27836437 1.11969756
 [7] 0.99261922 0.98195391 0.88608815 0.78823343 0.72787642 0.62310820
[13] 0.58619098 0.51085986 0.40663713 0.35703075 0.29902045 0.25368764
[19] 0.15785633 0.13227153 0.10380504 0.09120766

Code

scree(df_scale, pc=FALSE)  # Use pc=FALSE for factor analysis

Parallel” analysis is an alternative technique that compares the scree of factors of the observed data with that of a random data matrix of the same size as the original. This may be done for continuous , dichotomous, or polytomous data using Pearson, tetrachoric or polychoric correlations. We will use fa.parallel() of {psych} package:

Code

fa.parallel(df_scale)

Parallel analysis suggests that the number of factors =  8  and the number of components =  6

Parallel analysis suggests that the number of factors = 8 and the number of components = 6

Factor analysis using `fa()` method

fa() function of {psych} package performs exploratory Factor analysis using MinRes (minimum residual) as well as following factoring methods (fm):

fm="minres" will do a minimum residual as will fm=“uls”. Both of these use a first derivative.

fm="ols" differs very slightly from “minres” in that it minimizes the entire residual matrix using an OLS procedure but uses the empirical first derivative. This will be slower.

fm="wls" will do a weighted least squares (WLS) solution,

fm="gls" does a generalized weighted least squares (GLS),

fm="pa" will do the principal factor solution,

fm="ml" will do a maximum likelihood factor analysis.

fm="minchi"will minimize the sample size weighted chi square when treating pairwise correlations with different number of subjects per pair.

fm ="minrank" will do a minimum rank factor analysis. .

fm="alpha"will do alpha factor analysis as described in Kaiser and Coffey (1965)

We use the fa() function to perform factor analysis on the scaled data. We specify the number of factors to extract (nfactors = 8) and the factorization method (fm = "pa" for principal axis factoring). We also specify the rotation method (rotate = "varimax") to simplify the factor loadings and make them easier to interpret. The rotate parameter specifies the rotation method to be used. Other popular rotation methods include oblimin” and promax.

Code

fa_result <- fa(r=df_scale,  
              nfactors = 8, 
              fm="pa", # principal axis
              rotate="varimax") 
#to show the loadings sorted by absolute value
print(fa_result, sort=TRUE)

Factor Analysis using method =  pa
Call: fa(r = df_scale, nfactors = 8, rotate = "varimax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
   item   PA1   PA6   PA3  PA4   PA2  PA7   PA5   PA8    h2       u2 com
Fe   10  0.79  0.20  0.28 0.10  0.03 0.13  0.09  0.21 0.815  0.18500 1.7
Co    6  0.77  0.54  0.10 0.05  0.00 0.08  0.00  0.00 0.914  0.08589 1.9
Mn   14  0.56  0.09  0.14 0.13  0.03 0.06  0.16 -0.13 0.409  0.59112 1.6
Cu    9  0.13  0.02  0.01 0.09  0.01 0.02  0.05  0.09 0.036  0.96436 3.1
Ni   16  0.14  0.99 -0.01 0.03  0.01 0.01  0.02  0.05 1.000 -0.00027 1.0
Cr    7  0.19  0.87 -0.01 0.03  0.00 0.02  0.01  0.00 0.789  0.21070 1.1
Mg   13  0.29  0.43  0.10 0.01  0.01 0.32 -0.06  0.41 0.547  0.45318 3.8
Rb   18 -0.04  0.02  0.81 0.07  0.00 0.43  0.05 -0.12 0.870  0.12975 1.6
Al    1  0.54  0.06  0.68 0.08 -0.01 0.18 -0.03  0.34 0.903  0.09664 2.7
Ba    3  0.15  0.00  0.67 0.06  0.04 0.06  0.00  0.16 0.508  0.49227 1.3
Ce    5  0.29 -0.02  0.56 0.04 -0.01 0.21  0.07 -0.21 0.488  0.51234 2.2
Pb   17  0.02  0.00  0.00 0.91  0.01 0.05  0.02 -0.05 0.827  0.17257 1.0
Zn   22  0.33  0.05  0.18 0.66  0.06 0.16  0.19  0.13 0.661  0.33907 2.1
Hg   11 -0.01  0.01 -0.06 0.61  0.36 0.05  0.01  0.02 0.505  0.49489 1.7
Cd    4 -0.01  0.01  0.05 0.34  0.03 0.06  0.22  0.06 0.178  0.82236 1.9
Sn   21  0.06  0.01  0.03 0.19  0.01 0.02  0.03 -0.01 0.043  0.95747 1.3
Sb   19 -0.01  0.00  0.02 0.09  0.92 0.03 -0.05 -0.01 0.861  0.13914 1.0
As    2  0.07  0.01  0.04 0.15  0.90 0.07  0.21  0.02 0.897  0.10275 1.2
Li   12  0.24  0.04  0.27 0.11  0.05 0.76  0.09  0.10 0.749  0.25121 1.6
Cs    8  0.02  0.03  0.19 0.11  0.05 0.42  0.04  0.00 0.229  0.77081 1.6
Mo   15  0.09  0.00  0.10 0.11  0.13 0.02  0.82  0.10 0.734  0.26614 1.2
Se   20  0.14  0.00 -0.05 0.20 -0.01 0.07  0.43 -0.10 0.264  0.73564 1.9

                       PA1  PA6  PA3  PA4  PA2  PA7  PA5  PA8
SS loadings           2.28 2.26 2.17 1.94 1.83 1.20 1.06 0.49
Proportion Var        0.10 0.10 0.10 0.09 0.08 0.05 0.05 0.02
Cumulative Var        0.10 0.21 0.31 0.39 0.48 0.53 0.58 0.60
Proportion Explained  0.17 0.17 0.16 0.15 0.14 0.09 0.08 0.04
Cumulative Proportion 0.17 0.34 0.51 0.65 0.79 0.88 0.96 1.00

Mean item complexity =  1.8
Test of the hypothesis that 8 factors are sufficient.

df null model =  231  with the objective function =  11.15 with Chi Square =  53549.84
df of  the model are 83  and the objective function was  0.4 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

The harmonic n.obs is  4813 with the empirical chi square  510.47  with prob <  9.8e-63 
The total n.obs was  4813  with Likelihood Chi Square =  1928.04  with prob <  0 

Tucker Lewis Index of factoring reliability =  0.904
RMSEA index =  0.068  and the 90 % confidence intervals are  0.065 0.071
BIC =  1224.28
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   PA1 PA6  PA3  PA4  PA2  PA7
Correlation of (regression) scores with factors   0.95   1 0.93 0.93 0.96 0.84
Multiple R square of scores with factors          0.90   1 0.86 0.87 0.93 0.71
Minimum correlation of possible factor scores     0.80   1 0.72 0.74 0.86 0.43
                                                   PA5  PA8
Correlation of (regression) scores with factors   0.87 0.78
Multiple R square of scores with factors          0.75 0.61
Minimum correlation of possible factor scores     0.51 0.22

Plot Factor Loading Matrices

Factor analysis or principal components analysis results are typically interpreted in terms of the major loadings on each factor. These structures may be represented as a table of loadings or graphically, where all loadings with an absolute value > some cut point are represented as an edge (path). fa.diagram uses the various diagram functions to draw the diagram. The default is to use the fa.diagram() function from the {psych} package.

Code

fa.diagram(fa_result)

Factor analysis using the `factanal()` method

The function factanal() from {stats} pcakge perform maximum-likelihood factor analysis on a covariance matrix or data matrix. We will apply a rotation to the factors to make them reduce the complexity of the factor loadings and make them easier to interpret, here we’ll use the varimax option. Varimax maximizes the sum of the variances of the squared loadings. Then we’ll print the model to see the results (and sort so the loadings are ordered). The output may be a bit long.

Code

factanal_results <- factanal(df_scale, 
                     factors = 8, 
                     scores = "Bartlett", 
                     rotation = "varimax")
print(factanal_results, sort = TRUE)


Call:
factanal(x = df_scale, factors = 8, scores = "Bartlett", rotation = "varimax")

Uniquenesses:
   Al    As    Ba    Cd    Ce    Co    Cr    Cs    Cu    Fe    Hg    Li    Mg 
0.015 0.107 0.500 0.836 0.516 0.096 0.199 0.755 0.968 0.175 0.481 0.277 0.518 
   Mn    Mo    Ni    Pb    Rb    Sb    Se    Sn    Zn 
0.565 0.203 0.005 0.189 0.154 0.125 0.786 0.942 0.355 

Loadings:
   Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
Co  0.766   0.539   0.116                                         
Fe  0.801   0.203   0.290                   0.117           0.155 
Mn  0.576           0.127   0.114                   0.133  -0.208 
Cr  0.184   0.875                                                 
Ni  0.140   0.987                                                 
Al  0.549           0.713                   0.168           0.366 
Ba  0.141           0.673                                   0.135 
Ce  0.256           0.553                   0.223          -0.236 
Rb                  0.805                   0.407          -0.154 
Hg                          0.617   0.360                         
Pb                          0.897                                 
Zn  0.326           0.171   0.654           0.179   0.190         
As                          0.156   0.899           0.223         
Sb                                  0.929                         
Li  0.235           0.294   0.110           0.739           0.104 
Mo  0.102                   0.112   0.128           0.864         
Cd                          0.337                   0.207         
Cs                  0.206   0.103           0.429                 
Cu  0.134                                                         
Mg  0.321   0.437                           0.340           0.254 
Se  0.119                   0.200                   0.392         
Sn                          0.228                                 

               Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
SS loadings      2.319   2.283   2.214   1.937   1.828   1.170   1.081   0.401
Proportion Var   0.105   0.104   0.101   0.088   0.083   0.053   0.049   0.018
Cumulative Var   0.105   0.209   0.310   0.398   0.481   0.534   0.583   0.601

Test of the hypothesis that 8 factors are sufficient.
The chi square statistic is 1588.01 on 83 degrees of freedom.
The p-value is 8.27e-277

The p-value is below 0.05 or 0.01 we can reject the hypothesis of perfect fit, meaning that we could probably try a different method or try a different number of factors.

Extract Loading and plot them

We’ll get the results from the factor analysis model we using the tidy() function from broom package and convert it to long format and create a weight matrix.

Code

fa_weight_matrix <- broom::tidy(factanal_results) %>% 
  pivot_longer(starts_with("fl"), names_to = "factor", values_to = "loading")

Plot loading

We can now plot the factor loadings using a heatmap. We’ll use the geom_tile() function from {ggplot2} to create a heatmap of the factor loadings. We’ll use the {scico} package to create a color palette for the heatmap.

Code

fa_loading_plot <- ggplot(fa_weight_matrix, aes(x = factor, y = variable, fill = loading)) + 
  geom_tile() + 
  labs(title = "FA loadings",
       x = NULL,
       y = NULL) + 
  scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) + 
  coord_fixed(ratio = 1/2)
print(fa_loading_plot)

Correlation among Bartlett’s scores

We’ll take the Bartlett’s scores and correlate them with each other much like before and create a correlation matrix like before.

Code

factanal_results$scores %>% 
  cor() %>% 
  data.frame() %>% 
  rownames_to_column("factor_x") %>% 
  pivot_longer(cols = -factor_x, names_to = "factor_y", values_to = "correlation") %>% 
  ggplot(aes(x = factor_x, y = factor_y, fill = correlation)) + 
  geom_tile() +
  geom_text(aes(label = round(correlation,4)), color = "grey") +
  labs(title = "Correlation between FA scores",
       x = NULL,
       y = NULL) +
  scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) + 
  coord_equal()

We can see little correlation but certainly a few non-zero correlations also.

Confirmatory Factor Analysis (CFA)

Confirmatory Factor Analysis (CFA) is a statistical technique used to test whether a hypothesized factor structure fits the observed data. Unlike Exploratory Factor Analysis (EFA), where the structure of factors is not predefined, CFA begins with a specific hypothesis about the relationships between observed variables and underlying latent factors based on theory or prior research.

Define Models

Code

model <- '
  Factor1 =~ As +  Cd + Cr +  Pb  
  Factor2 =~ Fe + Mn +  Al+ Mg'

Fit the CFA Model

We can fit the CFA model using the cfa() function from the {lavaan} package. The cfa() function takes the model specification and the data as input and returns an object that contains the results of the model fit.

Code

# Fit the CFA model
fit <- cfa(model, data = df_scale) 
# Summarize the results
summary(fit, fit.measures = TRUE, standardized = TRUE)

lavaan 0.6-19 ended normally after 38 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        17

  Number of observations                          4813

Model Test User Model:
                                                      
  Test statistic                              1411.087
  Degrees of freedom                                19
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                              8489.377
  Degrees of freedom                                28
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.835
  Tucker-Lewis Index (TLI)                       0.758

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -51091.664
  Loglikelihood unrestricted model (H1)     -50386.121
                                                      
  Akaike (AIC)                              102217.329
  Bayesian (BIC)                            102327.473
  Sample-size adjusted Bayesian (SABIC)     102273.453

Root Mean Square Error of Approximation:

  RMSEA                                          0.123
  90 Percent confidence interval - lower         0.118
  90 Percent confidence interval - upper         0.129
  P-value H_0: RMSEA <= 0.050                    0.000
  P-value H_0: RMSEA >= 0.080                    1.000

Standardized Root Mean Square Residual:

  SRMR                                           0.095

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Factor1 =~                                                            
    As                1.000                               0.297    0.297
    Cd                1.543    0.138   11.171    0.000    0.459    0.459
    Cr                0.531    0.080    6.627    0.000    0.158    0.158
    Pb                1.857    0.173   10.722    0.000    0.552    0.552
  Factor2 =~                                                            
    Fe                1.000                               0.969    0.969
    Mn                0.508    0.015   33.663    0.000    0.492    0.492
    Al                0.792    0.015   51.685    0.000    0.768    0.768
    Mg                0.503    0.015   33.344    0.000    0.488    0.488

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Factor1 ~~                                                            
    Factor2           0.079    0.009    9.208    0.000    0.275    0.275

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .As                0.911    0.021   43.513    0.000    0.911    0.912
   .Cd                0.789    0.025   31.806    0.000    0.789    0.789
   .Cr                0.975    0.020   47.649    0.000    0.975    0.975
   .Pb                0.695    0.030   23.043    0.000    0.695    0.695
   .Fe                0.060    0.014    4.264    0.000    0.060    0.060
   .Mn                0.757    0.016   47.367    0.000    0.757    0.758
   .Al                0.410    0.012   33.588    0.000    0.410    0.410
   .Mg                0.762    0.016   47.427    0.000    0.762    0.762
    Factor1           0.088    0.013    6.910    0.000    1.000    1.000
    Factor2           0.940    0.025   37.995    0.000    1.000    1.000

Model Fit Indices

These metrics tell you how well the hypothesized factor structure fits the observed data.

Chi-Square Test:
- Null hypothesis: The model fits the data perfectly.
- A small ppp-value indicates a lack of fit (often ignored for large datasets).
Comparative Fit Index (CFI):
- Acceptable: >0.90.
- Excellent: >0.95.
Root Mean Square Error of Approximation (RMSEA):
- Acceptable: <0.08.
- Excellent: <0.05.
Standardized Root Mean Square Residual (SRMR):
- Acceptable: <0.08.

Summary and Conclusion

In this tutorial, we have discussed the concept of Factor Analysis, a statistical method used to identify the underlying relationships between observed variables. Factor Analysis is a useful tool for reducing the dimensionality of data and uncovering latent structures that explain the observed correlations among variables. We have also demonstrated how to perform Factor Analysis in R using the {psych} and {lavaan} packages. Factor Analysis can be a powerful tool for exploring complex datasets and uncovering hidden patterns and relationships that may not be apparent from the raw data.

2. Factor Analysis

Overview

Factor Analysis from Scratch

Synthetic Data Generation

Standardization and Correlation Matrix

Compute Correlation Matrix

Kaiser-Meyer-Olkin (KMO) Measure

Maximum Likelihood Factor Analysis

Minimum Residual Factor Analysis

Factor Analysis in R

Install Required R Packages

Load Packages

Data

Scale the data

Correlation Analysis

Exploratory Factor Analysis (EFA)

Kaiser-Meyer-Olkin (KMO)

Bartlett’s Test of Sphericity

Determine Number of Factors to Extract

Factor analysis using fa() method

Plot Factor Loading Matrices

Factor analysis using the factanal() method

Extract Loading and plot them

Plot loading

Correlation among Bartlett’s scores

Confirmatory Factor Analysis (CFA)

Define Models

Fit the CFA Model

Model Fit Indices

Summary and Conclusion

References

Factor analysis using `fa()` method

Factor analysis using the `factanal()` method