Factor Analysis

Factor Analysis uses to identify a smaller number of unobserved variables, called factors, from a larger set of observed variables. However, unlike PCA, Factor Analysis allows for the possibility that the observed variables may be correlated with each other due to shared underlying factors. Factor Analysis assumes that the observed variables are linearly related to the unobserved factors, and seeks to identify these factors that account for the observed correlations among the variables.

Data

In this exercise we will use following data set PCA analysis. This dataset contains heavy metal concentration in the soil of USA.

usa_geochemical_FA.csv

Code
library(tidyverse)
library(dlookr)
# define data folder
dataFolder<-"E:/Dropbox/GitHub/Data/USA/"
# Load data
mf<-read_csv(paste0(dataFolder, "usa_geochemical_FA.csv"))%>%
    glimpse()
Rows: 4,813
Columns: 28
$ A_LabID   <chr> "C-320500", "C-299963", "C-319572", "C-355112", "C-319622", …
$ SiteID    <dbl> 847, 7209, 6811, 4072, 4315, 12782, 10830, 252, 1964, 6920, …
$ StateID   <chr> "NV", "ME", "CO", "WI", "CO", "MT", "WA", "PA", "IN", "MO", …
$ Latitude  <dbl> 38.8486, 44.2417, 37.8284, 42.7492, 39.2340, 46.3279, 46.499…
$ Longitude <dbl> -117.5824, -70.4833, -107.6216, -90.2205, -106.2517, -112.90…
$ LandCover <chr> "Shrubland", "Forested Upland", "Shrubland", "Forested Uplan…
$ Al        <dbl> 7.43, 5.35, 5.82, 4.61, 6.24, 3.76, 6.99, 7.53, 9.06, 3.02, …
$ As        <dbl> 195.0, 2.7, 18.4, 7.7, 25.1, 565.0, 4.2, 10.5, 94.5, 9.0, 5.…
$ Ba        <dbl> 496, 352, 686, 610, 1030, 399, 706, 466, 414, 317, 372, 654,…
$ Cd        <dbl> 16.1, 0.5, 5.1, 0.6, 3.2, 2.6, 0.6, 0.2, 1.4, 1.4, 1.1, 1.2,…
$ Ce        <dbl> 42.0, 43.0, 67.9, 51.9, 48.9, 28.0, 67.9, 75.5, 104.0, 47.6,…
$ Co        <dbl> 11.1, 5.2, 14.4, 9.5, 6.8, 8.9, 13.3, 21.1, 56.8, 11.7, 4.8,…
$ Cr        <dbl> 42, 63, 24, 38, 9, 20, 48, 81, 217, 56, 24, 16, 59, 13, 54, …
$ Cs        <dbl> 26.0, 2.5, 12.0, 2.5, 2.5, 5.0, 5.0, 2.5, 7.0, 2.5, 2.5, 2.5…
$ Cu        <dbl> 87.9, 16.5, 141.0, 35.4, 147.0, 232.0, 34.7, 79.8, 98.7, 13.…
$ Fe        <dbl> 4.17, 1.59, 3.63, 2.26, 7.39, 13.00, 3.44, 5.29, 8.28, 2.40,…
$ Hg        <dbl> 8.24, 0.04, 0.07, 0.04, 1.32, 0.44, 0.03, 0.08, 0.32, 0.06, …
$ Li        <dbl> 128, 23, 51, 19, 11, 28, 24, 37, 71, 16, 14, 57, 29, 29, 23,…
$ Mg        <dbl> 1.62, 0.31, 0.77, 0.57, 0.34, 0.62, 0.80, 0.65, 0.39, 0.16, …
$ Mn        <dbl> 740, 635, 4040, 719, 812, 6580, 713, 879, 480, 894, 430, 233…
$ Mo        <dbl> 9.16, 1.06, 9.41, 0.87, 4.23, 70.30, 0.72, 2.10, 5.02, 1.33,…
$ Ni        <dbl> 25.4, 34.8, 9.8, 20.3, 4.0, 7.2, 21.1, 31.9, 217.0, 11.1, 9.…
$ Pb        <dbl> 2200.0, 34.0, 865.0, 49.1, 1510.0, 62.5, 83.0, 268.0, 259.0,…
$ Rb        <dbl> 110.0, 79.0, 143.0, 72.5, 89.1, 46.6, 64.7, 88.6, 69.8, 51.6…
$ Sb        <dbl> 32.30, 1.60, 6.04, 1.21, 6.26, 20.80, 0.77, 1.19, 9.84, 1.13…
$ Se        <dbl> 3.1, 0.2, 0.4, 0.5, 0.7, 0.4, 0.1, 0.9, 1.3, 0.6, 0.4, 0.1, …
$ Sn        <dbl> 3.2, 2.5, 2.1, 1.3, 66.9, 32.6, 2.0, 2.5, 82.4, 1.8, 9.7, 5.…
$ Zn        <dbl> 2130, 1440, 977, 943, 770, 689, 612, 541, 536, 518, 491, 491…

First we will create a subset of the numeric variables from the dataset:

Code
df<- mf[,7:28] 

Exploratory Factor Analysis

Scale the data

In R, scale() is generic function whose default method centers and/or scales the columns of a numeric matrix.

Code
df_scale <- as.data.frame(scale(df, center=TRUE, scale=TRUE)) 
describe(df_scale)
# A tibble: 22 × 26
   described_variables     n    na      mean    sd se_mean   IQR skewness
   <chr>               <int> <int>     <dbl> <dbl>   <dbl> <dbl>    <dbl>
 1 Al                   4813     0  7.45e-17     1  0.0144 1.35    0.0419
 2 As                   4813     0 -9.31e-18     1  0.0144 0.239  42.9   
 3 Ba                   4813     0 -1.49e-16     1  0.0144 1.23    1.89  
 4 Cd                   4813     0 -3.86e-17     1  0.0144 0.259  47.1   
 5 Ce                   4813     0  8.50e-17     1  0.0144 1.07    2.46  
 6 Co                   4813     0 -2.05e-17     1  0.0144 0.819   6.03  
 7 Cr                   4813     0 -4.02e-17     1  0.0144 0.269  30.7   
 8 Cs                   4813     0  7.63e-17     1  0.0144 0      14.8   
 9 Cu                   4813     0  1.60e-17     1  0.0144 0.168  63.0   
10 Fe                   4813     0 -8.14e-17     1  0.0144 0.996   1.96  
# ℹ 12 more rows
# ℹ 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
#   p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
#   p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
#   p99 <dbl>, p100 <dbl>

Correlation Analysis

We should take a look at the correlations among among heavy metals to determine if factor analysis is appropriate. We will use corr() function corrplot package.

install.packages(corrplot)

Code
#| fig.width: 6
#| fig.height: 6

library(corrplot)
corr_matrix <- cor(df_scale)
corrplot(corr_matrix, method="number")

Kaiser-Meyer-Olkin (KMO)

Kaiser-Meyer-Olkin (KMO) is a statistical measure used to assess the sampling adequacy of a set of variables for factor analysis. It is named after the three researchers who developed it: Henry Kaiser, J.B. Meyer, and Norbert Olkin.

The KMO statistic is a value between 0 and 1, with higher values indicating better sampling adequacy. A KMO value of 0.5 or lower is considered unacceptable, while a value of 0.6 to 0.7 is marginal, and values above 0.8 are generally considered good.

The KMO statistic is calculated by comparing the magnitude of the correlation coefficients between pairs of variables to the magnitude of the partial correlations between those variables, controlling for all other variables in the analysis. If the correlation coefficients are much larger than the partial correlations, then the KMO statistic will be high, indicating that the variables are suitable for factor analysis.

The KMO statistic is widely used in the field of psychology and social sciences to assess the quality of data for factor analysis. It is an important tool for researchers to evaluate the suitability of their data for factor analysis and to make decisions about how to proceed with their analysis.

In R, you can compute the Kaiser-Meyer-Olkin (KMO) statistic using the psych package. Here’s an example:

insatll.package(“psych”)

Code
library(psych)

Attaching package: 'psych'
The following object is masked from 'package:dlookr':

    describe
The following objects are masked from 'package:ggplot2':

    %+%, alpha
Code
# compute the KMO statistic
KMO(r=cor(df_scale))
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = cor(df_scale))
Overall MSA =  0.74
MSA for each item = 
  Al   As   Ba   Cd   Ce   Co   Cr   Cs   Cu   Fe   Hg   Li   Mg   Mn   Mo   Ni 
0.79 0.54 0.80 0.87 0.84 0.82 0.71 0.80 0.84 0.77 0.78 0.83 0.84 0.82 0.57 0.66 
  Pb   Rb   Sb   Se   Sn   Zn 
0.66 0.75 0.50 0.65 0.57 0.84 

According to Kaiser’s (1974) guidelines, a suggested cutoff for determining the factorability of the sample data is KMO ≥ 60. The total KMO is 0.74, indicating that, based on this test, we can probably conduct a factor analysis.

Bartlett’s Test of Sphericity

Bartlett’s Test of Sphericity is a statistical test used to determine whether a correlation matrix is significantly different from an identity matrix. It is commonly used in factor analysis to assess whether the variables in a dataset are suitable for dimensionality reduction.

The null hypothesis of Bartlett’s test is that the variables are uncorrelated in the population, while the alternative hypothesis is that they are correlated. If the test statistic is large and the p-value is small (e.g., less than 0.05), then we reject the null hypothesis and conclude that the correlation matrix is significantly different from an identity matrix, and therefore the variables are correlated.

In R, you can perform Bartlett’s test of sphericity using the psych package. Here’s an example:

Code
cortest.bartlett(df_scale)
R was not square, finding R from data
$chisq
[1] 53549.84

$p.value
[1] 0

$df
[1] 231

Small values (0 < 0.05) of the significance level indicate that a factor analysis may be useful with our data.

Determine Number of Factors to Extract

Since we are seeking latent structure within a set of data, we will only be interested in factors that explain a substantial proportion of variation within the data.

One way to determine the number of factors or components in a data matrix or a correlation matrix is to examine the “scree” plot of the successive eigenvalues. Sharp breaks in the plot suggest the appropriate number of components or factors to extract. First we use eigen() to computes eigenvalues and eigenvectors and plot the successive eigen values for a scree test using scree() function:

Code
ev <- eigen(cor(df_scale)) # get eigenvalues
ev$values
 [1] 5.16673191 2.65225466 2.29148907 1.59301574 1.27836437 1.11969756
 [7] 0.99261922 0.98195391 0.88608815 0.78823343 0.72787642 0.62310820
[13] 0.58619098 0.51085986 0.40663713 0.35703075 0.29902045 0.25368764
[19] 0.15785633 0.13227153 0.10380504 0.09120766
Code
#| fig.width: 5
#| fig.height: 5
scree(df_scale, pc=FALSE)  # Use pc=FALSE for factor analysis

Parallel” analyis is an alternative technique that compares the scree of factors of the observed data with that of a random data matrix of the same size as the original. This may be done for continuous , dichotomous, or polytomous data using Pearson, tetrachoric or polychoric correlations. We will use fa.parallel() of psych package:

Code
#| fig.width: 5
#| fig.height: 5
 
fa.parallel(df_scale)

Parallel analysis suggests that the number of factors =  8  and the number of components =  6 

Factor analysis using fa method

fa() function of psych package performs exploratory Factor analysis using MinRes (minimum residual) as well as following factoring methods (fm):

fm=“minres” will do a minimum residual as will fm=“uls”. Both of these use a first derivative.

fm=“ols” differs very slightly from “minres” in that it minimizes the entire residual matrix using an OLS procedure but uses the empirical first derivative. This will be slower.

fm=“wls” will do a weighted least squares (WLS) solution,

fm=“gls” does a generalized weighted least squares (GLS),

fm=“pa” will do the principal factor solution,

fm=“ml” will do a maximum likelihood factor analysis.

fm=“minchi” will minimize the sample size weighted chi square when treating pairwise correlations with different number of subjects per pair.

fm =“minrank” will do a minimum rank factor analysis. .

fm=“alpha” will do alpha factor analysis as described in Kaiser and Coffey (1965)

Code
fa_result <- fa(r=df_scale,  
              nfactors = 8, 
              fm="pa", # principal axis
              rotate="varimax") 

In the above code, we performed a factor analysis with 8 factors and used the varimax rotation method to obtain simpler and easier to interpret factors. The “nfactors” parameter specifies the number of factors we want to extract from the data. You can set this value based on the number of variables in your dataset or based on theoretical considerations. The “rotate” parameter specifies the rotation method to be used. Other popular rotation methods include “oblimin” and “promax”.

Code
#to show the loadings sorted by absolute value
print(fa_result, sort=TRUE)
Factor Analysis using method =  pa
Call: fa(r = df_scale, nfactors = 8, rotate = "varimax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
   item   PA1   PA6   PA3  PA4   PA2  PA7   PA5   PA8    h2       u2 com
Fe   10  0.79  0.20  0.28 0.10  0.03 0.13  0.09  0.21 0.815  0.18500 1.7
Co    6  0.77  0.54  0.10 0.05  0.00 0.08  0.00  0.00 0.914  0.08589 1.9
Mn   14  0.56  0.09  0.14 0.13  0.03 0.06  0.16 -0.13 0.409  0.59112 1.6
Cu    9  0.13  0.02  0.01 0.09  0.01 0.02  0.05  0.09 0.036  0.96436 3.1
Ni   16  0.14  0.99 -0.01 0.03  0.01 0.01  0.02  0.05 1.000 -0.00027 1.0
Cr    7  0.19  0.87 -0.01 0.03  0.00 0.02  0.01  0.00 0.789  0.21070 1.1
Mg   13  0.29  0.43  0.10 0.01  0.01 0.32 -0.06  0.41 0.547  0.45318 3.8
Rb   18 -0.04  0.02  0.81 0.07  0.00 0.43  0.05 -0.12 0.870  0.12975 1.6
Al    1  0.54  0.06  0.68 0.08 -0.01 0.18 -0.03  0.34 0.903  0.09664 2.7
Ba    3  0.15  0.00  0.67 0.06  0.04 0.06  0.00  0.16 0.508  0.49227 1.3
Ce    5  0.29 -0.02  0.56 0.04 -0.01 0.21  0.07 -0.21 0.488  0.51234 2.2
Pb   17  0.02  0.00  0.00 0.91  0.01 0.05  0.02 -0.05 0.827  0.17257 1.0
Zn   22  0.33  0.05  0.18 0.66  0.06 0.16  0.19  0.13 0.661  0.33907 2.1
Hg   11 -0.01  0.01 -0.06 0.61  0.36 0.05  0.01  0.02 0.505  0.49489 1.7
Cd    4 -0.01  0.01  0.05 0.34  0.03 0.06  0.22  0.06 0.178  0.82236 1.9
Sn   21  0.06  0.01  0.03 0.19  0.01 0.02  0.03 -0.01 0.043  0.95747 1.3
Sb   19 -0.01  0.00  0.02 0.09  0.92 0.03 -0.05 -0.01 0.861  0.13914 1.0
As    2  0.07  0.01  0.04 0.15  0.90 0.07  0.21  0.02 0.897  0.10275 1.2
Li   12  0.24  0.04  0.27 0.11  0.05 0.76  0.09  0.10 0.749  0.25121 1.6
Cs    8  0.02  0.03  0.19 0.11  0.05 0.42  0.04  0.00 0.229  0.77081 1.6
Mo   15  0.09  0.00  0.10 0.11  0.13 0.02  0.82  0.10 0.734  0.26614 1.2
Se   20  0.14  0.00 -0.05 0.20 -0.01 0.07  0.43 -0.10 0.264  0.73564 1.9

                       PA1  PA6  PA3  PA4  PA2  PA7  PA5  PA8
SS loadings           2.28 2.26 2.17 1.94 1.83 1.20 1.06 0.49
Proportion Var        0.10 0.10 0.10 0.09 0.08 0.05 0.05 0.02
Cumulative Var        0.10 0.21 0.31 0.39 0.48 0.53 0.58 0.60
Proportion Explained  0.17 0.17 0.16 0.15 0.14 0.09 0.08 0.04
Cumulative Proportion 0.17 0.34 0.51 0.65 0.79 0.88 0.96 1.00

Mean item complexity =  1.8
Test of the hypothesis that 8 factors are sufficient.

df null model =  231  with the objective function =  11.15 with Chi Square =  53549.84
df of  the model are 83  and the objective function was  0.4 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

The harmonic n.obs is  4813 with the empirical chi square  510.47  with prob <  9.8e-63 
The total n.obs was  4813  with Likelihood Chi Square =  1928.04  with prob <  0 

Tucker Lewis Index of factoring reliability =  0.904
RMSEA index =  0.068  and the 90 % confidence intervals are  0.065 0.071
BIC =  1224.28
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   PA1 PA6  PA3  PA4  PA2  PA7
Correlation of (regression) scores with factors   0.95   1 0.93 0.93 0.96 0.84
Multiple R square of scores with factors          0.90   1 0.86 0.87 0.93 0.71
Minimum correlation of possible factor scores     0.80   1 0.72 0.74 0.86 0.43
                                                   PA5  PA8
Correlation of (regression) scores with factors   0.87 0.78
Multiple R square of scores with factors          0.75 0.61
Minimum correlation of possible factor scores     0.51 0.22

Graph Factor Loading Matrices

Factor analysis or principal components analysis results are typically interpreted in terms of the major loadings on each factor. These structures may be represented as a table of loadings or graphically, where all loadings with an absolute value > some cut point are represented as an edge (path). fa.diagram uses the various diagram functions to draw the diagram.

Code
#| fig.width: 5
#| fig.height: 5

fa.diagram(fa_result)

Factor analysis using the factanal method

R-base function factanal() perform maximum-likelihood factor analysis on a covariance matrix or data matrix. We will apply a “rotation” to the factors to make them reduce the complexity of the factor loadings and make them easier to interpret, here we’ll use the varimax option. Varimax maximizes the sum of the variances of the squared loadings. Then we’ll print the model to see the results (and sort so the loadings are ordered). The output may be a bit long.

Code
factanal_results <- factanal(df_scale, 
                     factors = 8, 
                     scores = "Bartlett", 
                     rotation = "varimax")
print(factanal_results, sort = TRUE)

Call:
factanal(x = df_scale, factors = 8, scores = "Bartlett", rotation = "varimax")

Uniquenesses:
   Al    As    Ba    Cd    Ce    Co    Cr    Cs    Cu    Fe    Hg    Li    Mg 
0.015 0.107 0.500 0.836 0.516 0.096 0.199 0.755 0.968 0.175 0.481 0.277 0.518 
   Mn    Mo    Ni    Pb    Rb    Sb    Se    Sn    Zn 
0.565 0.203 0.005 0.189 0.154 0.125 0.786 0.942 0.355 

Loadings:
   Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
Co  0.766   0.539   0.116                                         
Fe  0.801   0.203   0.290                   0.117           0.155 
Mn  0.576           0.127   0.114                   0.133  -0.208 
Cr  0.184   0.875                                                 
Ni  0.140   0.987                                                 
Al  0.549           0.713                   0.168           0.366 
Ba  0.141           0.673                                   0.135 
Ce  0.256           0.553                   0.223          -0.236 
Rb                  0.805                   0.407          -0.154 
Hg                          0.617   0.360                         
Pb                          0.897                                 
Zn  0.326           0.171   0.654           0.179   0.190         
As                          0.156   0.899           0.223         
Sb                                  0.929                         
Li  0.235           0.294   0.110           0.739           0.104 
Mo  0.102                   0.112   0.128           0.864         
Cd                          0.337                   0.207         
Cs                  0.206   0.103           0.429                 
Cu  0.134                                                         
Mg  0.321   0.437                           0.340           0.254 
Se  0.119                   0.200                   0.392         
Sn                          0.228                                 

               Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
SS loadings      2.319   2.283   2.214   1.937   1.828   1.170   1.081   0.401
Proportion Var   0.105   0.104   0.101   0.088   0.083   0.053   0.049   0.018
Cumulative Var   0.105   0.209   0.310   0.398   0.481   0.534   0.583   0.601

Test of the hypothesis that 8 factors are sufficient.
The chi square statistic is 1588.01 on 83 degrees of freedom.
The p-value is 8.27e-277 

The p-value is below 0.05 or 0.01 we can reject the hypothesis of perfect fit, meaning that we could probably try a different method or try a different number of factors.

Extract Loading and plot them

We’ll get the results from the factor analysis model we using the tidy() function from broom package and convert it to long format and create a weight matrix.

Code
fa_weight_matrix <- broom::tidy(factanal_results) %>% 
  pivot_longer(starts_with("fl"), names_to = "factor", values_to = "loading")
Code
#| fig.width: 6
#| fig.height: 5

fa_loading_plot <- ggplot(fa_weight_matrix, aes(x = factor, y = variable, fill = loading)) + 
  geom_tile() + 
  labs(title = "FA loadings",
       x = NULL,
       y = NULL) + 
  scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) + 
  coord_fixed(ratio = 1/2)

print(fa_loading_plot)

Correlation among Bartlett’s scores

We’ll take the Bartlett’s scores and correlate them with each other much like before and create a correlation matrix like before.

Code
factanal_results$scores %>% 
  cor() %>% 
  data.frame() %>% 
  rownames_to_column("factor_x") %>% 
  pivot_longer(cols = -factor_x, names_to = "factor_y", values_to = "correlation") %>% 
  ggplot(aes(x = factor_x, y = factor_y, fill = correlation)) + 
  geom_tile() +
  geom_text(aes(label = round(correlation,4)), color = "white") +
  labs(title = "Correlation between FA scores",
       x = NULL,
       y = NULL) +
  scico::scale_fill_scico(palette = "berlin", limits = c(-1,1)) + 
  coord_equal()

We can see little correlation but certainly a few non-zero correlations also.

Exercise

  1. Create a R-Markdown documents (name homework_10.rmd) in this project and do all Tasks using the data shown below.

  2. Submit all codes and output as a HTML document (homework_10.html) before class of next week.

Required R-Package

tidyverse, factoextra, psych

Data

  1. bd_arsenic_data_FA

Download the data and save in your project directory. Use read_csv to load the data in your R-session. For example:

df<-read_csv(“bd_arsenic_data_FA.csv”)

Tasks

  1. Scale the data

  2. Do correlation analysis and plot the correlation matrix

  3. Do Kaiser-Meyer-Olkin (KMO) test

  4. Determine Number of Factors to Extract

  5. Do Factor analysis using both fa() and factanal() methods

  6. Extract loading from factanal() method and plot them

Further Reading

  1. Factor analysis

  2. Factor analysis in R

  3. A Basic Comparison Between Factor Analysis, PCA, and ICA https://rpkgs.datanovia.com/factoextra/