Data Exploration with {skimr} in R

This tutorial will guide you through the process of using the {skimr} package for exploratory data analysis (EDA) in R. The {skimr} package is a powerful tool that provides a comprehensive overview of your dataset, allowing you to quickly understand its structure and characteristics. It generates summary statistics and visualizations, making it easier to identify patterns, trends, and potential issues in your data.

Introduction

{skimr} provides a frictionless approach to summary statistics that conforms to the principle of least surprise. It displays summary statistics that the user can skim quickly to understand their data. It handles different data types and returns a skim_df object that can be included in a pipeline or displayed nicely for the human reader.

Check and Install Required Packages

Code
packages <- c('tidyverse', 
         'skimr'
         )
#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Verify Installation

Code
# Verify installation
cat("Installed packages:\n")
Installed packages:
Code
print(sapply(packages, requireNamespace, quietly = TRUE))
tidyverse     skimr 
     TRUE      TRUE 

Load Packages

Code
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Check Loaded Packages

Code
# Check loaded packages
cat("Successfully loaded packages:\n")
Successfully loaded packages:
Code
print(search()[grepl("package:", search())])
 [1] "package:skimr"     "package:lubridate" "package:forcats"  
 [4] "package:stringr"   "package:dplyr"     "package:purrr"    
 [7] "package:readr"     "package:tidyr"     "package:tibble"   
[10] "package:ggplot2"   "package:tidyverse" "package:stats"    
[13] "package:graphics"  "package:grDevices" "package:utils"    
[16] "package:datasets"  "package:methods"   "package:base"     

Data

The data set use in this exercise can be downloaded from my Dropbox or from my Github account.

We will use read_csv() function of readr package to import data as a tidy data.

Code
mf<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/gp_soil_data_na.csv")

Getting Started with skimr()

The skim function is the main function provided by the skimr package. It generates a summary of the dataset, including key statistics for each variable.

You’ll get output grouped by data type (numeric, factor, etc.), showing:

  • Count of missing values

  • Mean, sd, min, max, and percentiles

  • Histograms (in console!)

  • Unique counts (for factors)

Code
 mf |>  dplyr::select(NLCD, SOC, DEM, MAP, MAT, NDVI) |> 
  skimr::skim() 
Data summary
Name dplyr::select(mf, NLCD, S…
Number of rows 471
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
NLCD 0 1 6 18 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SOC 4 0.99 6.35 5.05 0.41 2.77 4.97 8.71 30.47 ▇▃▁▁▁
DEM 0 1.00 1631.11 767.69 258.65 1175.33 1592.89 2234.26 3618.02 ▅▇▇▅▁
MAP 0 1.00 499.37 206.94 193.91 352.77 432.63 590.43 1128.11 ▆▇▂▂▁
MAT 0 1.00 8.89 4.10 -0.59 5.88 9.17 12.44 16.87 ▃▅▇▇▅
NDVI 0 1.00 0.44 0.16 0.14 0.31 0.42 0.56 0.80 ▆▇▆▅▃

Skim only numeric columns

Code
df <- mf |>  dplyr::select(NLCD, SOC, DEM, MAP, MAT, NDVI) 
skim(df[, sapply(df, is.numeric)])
Data summary
Name df[, sapply(df, is.numeri…
Number of rows 471
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SOC 4 0.99 6.35 5.05 0.41 2.77 4.97 8.71 30.47 ▇▃▁▁▁
DEM 0 1.00 1631.11 767.69 258.65 1175.33 1592.89 2234.26 3618.02 ▅▇▇▅▁
MAP 0 1.00 499.37 206.94 193.91 352.77 432.63 590.43 1128.11 ▆▇▂▂▁
MAT 0 1.00 8.89 4.10 -0.59 5.88 9.17 12.44 16.87 ▃▅▇▇▅
NDVI 0 1.00 0.44 0.16 0.14 0.31 0.42 0.56 0.80 ▆▇▆▅▃

Skim individual variables or subsets

Code
skim(df$SOC)
Data summary
Name df$SOC
Number of rows 471
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 4 0.99 6.35 5.05 0.41 2.77 4.97 8.71 30.47 ▇▃▁▁▁

Grouped Data Summaries

Code
# Group by NLCD and summarize
df |> 
  group_by(NLCD) |> 
  skimr::skim() 
Data summary
Name group_by(df, NLCD)
Number of rows 471
Number of columns 6
_______________________
Column type frequency:
numeric 5
________________________
Group variables NLCD

Variable type: numeric

skim_variable NLCD n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SOC Forest 0 1.00 10.43 6.80 1.33 4.93 8.97 15.30 30.47 ▇▅▃▂▁
SOC Herbaceous 1 0.99 5.48 3.93 0.41 2.62 4.61 6.97 18.81 ▇▆▂▁▁
SOC Planted/Cultivated 0 1.00 6.70 3.60 0.46 4.00 6.23 9.42 16.34 ▅▇▆▃▁
SOC Shrubland 3 0.98 4.13 3.74 0.45 1.35 3.00 5.66 19.10 ▇▂▁▁▁
DEM Forest 0 1.00 2567.36 335.76 1623.32 2326.66 2535.11 2769.29 3471.05 ▁▆▇▃▂
DEM Herbaceous 0 1.00 1362.70 589.24 277.66 1203.03 1355.11 1627.98 3618.02 ▃▇▃▁▁
DEM Planted/Cultivated 0 1.00 804.34 489.41 258.65 404.91 732.54 1038.99 2324.50 ▇▅▃▁▁
DEM Shrubland 0 1.00 1889.98 432.52 1013.55 1558.02 1961.23 2183.34 2723.78 ▃▆▆▇▃
MAP Forest 0 1.00 593.09 156.64 330.95 469.25 563.04 684.75 1121.27 ▅▇▃▂▁
MAP Herbaceous 0 1.00 472.54 188.98 205.06 365.02 414.31 469.75 1128.11 ▇▇▁▂▁
MAP Planted/Cultivated 0 1.00 646.73 232.53 193.91 465.05 587.50 827.85 1126.82 ▂▇▃▅▂
MAP Shrubland 0 1.00 353.53 108.73 201.51 288.59 337.54 391.26 1109.41 ▇▃▁▁▁
MAT Forest 0 1.00 4.72 3.04 -0.34 2.23 4.63 6.89 14.78 ▇▇▇▂▁
MAT Herbaceous 0 1.00 10.08 2.86 -0.59 7.69 10.07 12.34 15.46 ▁▁▇▇▆
MAT Planted/Cultivated 0 1.00 11.84 2.06 1.45 11.07 12.44 12.94 14.60 ▁▁▁▅▇
MAT Shrubland 0 1.00 8.28 4.57 0.38 4.71 6.98 12.45 16.87 ▅▇▅▂▆
NDVI Forest 0 1.00 0.57 0.12 0.28 0.53 0.58 0.65 0.78 ▃▁▇▇▅
NDVI Herbaceous 0 1.00 0.40 0.13 0.16 0.31 0.38 0.44 0.73 ▃▇▅▂▂
NDVI Planted/Cultivated 0 1.00 0.53 0.12 0.32 0.45 0.51 0.59 0.80 ▅▇▆▂▃
NDVI Shrubland 0 1.00 0.31 0.13 0.14 0.21 0.27 0.38 0.69 ▇▅▂▂▁

Customizing Skim Output

skim_with() allows you to customize the summary statistics displayed by skim(). You can specify which functions to use for numeric, factor, and character data types.

Code
my_skim <- skim_with(numeric = sfl(median, mad), append = FALSE)
my_skim(df)
Data summary
Name df
Number of rows 471
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
NLCD 0 1 6 18 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate median mad
SOC 4 0.99 NA NA
DEM 0 1.00 1592.89 874.74
MAP 0 1.00 432.63 152.23
MAT 0 1.00 9.17 4.88
NDVI 0 1.00 0.42 0.19

Handling Different Data Types

Code
library(lubridate)

# Create a date column
data <- tibble(
  date = seq(as.Date("2023-01-01"), by = "month", length.out = 6),
  value = rnorm(6)
)

# Skim dates
skim(data)
Data summary
Name data
Number of rows 6
Number of columns 2
_______________________
Column type frequency:
Date 1
numeric 1
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2023-01-01 2023-06-01 2023-03-16 6

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
value 0 1 -1.03 0.79 -1.76 -1.6 -1.27 -0.65 0.28 ▇▇▁▃▃

Summary and Conclusion

This tutorial guides you through the process of using the {skimr} package for exploratory data analysis (EDA) in R. The {skimr} package is a powerful tool that provides a comprehensive overview of your dataset, allowing you to quickly understand its structure and characteristics. It generates summary statistics and visualizations, making it easier to identify patterns, trends, and potential issues in your data. The end of this tutorial, you should be able to use the {skimr} package to perform EDA on your own datasets and gain valuable insights into their characteristics.

References

  1. skimr

  2. Introduction to skimr