Data Exploration with {skimr} in R

This tutorial will guide you through the process of using the {skimr} package for exploratory data analysis (EDA) in R. The {skimr} package is a powerful tool that provides a comprehensive overview of your dataset, allowing you to quickly understand its structure and characteristics. It generates summary statistics and visualizations, making it easier to identify patterns, trends, and potential issues in your data.

Introduction

{skimr} provides a frictionless approach to summary statistics that conforms to the principle of least surprise. It displays summary statistics that the user can skim quickly to understand their data. It handles different data types and returns a skim_df object that can be included in a pipeline or displayed nicely for the human reader.

Check and Install Required Packages

Code

packages <- c('tidyverse', 
         'skimr'
         )

#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Verify Installation

Code

# Verify installation
cat("Installed packages:\n")

Installed packages:

Code

print(sapply(packages, requireNamespace, quietly = TRUE))

tidyverse     skimr 
     TRUE      TRUE

Load Packages

Code

# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Check Loaded Packages

Code

# Check loaded packages
cat("Successfully loaded packages:\n")

Successfully loaded packages:

Code

print(search()[grepl("package:", search())])

 [1] "package:skimr"     "package:lubridate" "package:forcats"  
 [4] "package:stringr"   "package:dplyr"     "package:purrr"    
 [7] "package:readr"     "package:tidyr"     "package:tibble"   
[10] "package:ggplot2"   "package:tidyverse" "package:stats"    
[13] "package:graphics"  "package:grDevices" "package:utils"    
[16] "package:datasets"  "package:methods"   "package:base"

Data

The data set use in this exercise can be downloaded from my Dropbox or from my Github account.

We will use read_csv() function of readr package to import data as a tidy data.

Code

mf<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/gp_soil_data_na.csv")

Getting Started with `skimr()`

The skim function is the main function provided by the skimr package. It generates a summary of the dataset, including key statistics for each variable.

You’ll get output grouped by data type (numeric, factor, etc.), showing:

Count of missing values
Mean, sd, min, max, and percentiles
Histograms (in console!)
Unique counts (for factors)

Code

 mf |>  dplyr::select(NLCD, SOC, DEM, MAP, MAT, NDVI) |> 
  skimr::skim()

Data summary
Name	dplyr::select(mf, NLCD, S…
Number of rows	471
Number of columns	6
_______________________
Column type frequency:
character	1
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
NLCD	0	1	6	18	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
SOC	4	0.99	6.35	5.05	0.41	2.77	4.97	8.71	30.47	▇▃▁▁▁
DEM	0	1.00	1631.11	767.69	258.65	1175.33	1592.89	2234.26	3618.02	▅▇▇▅▁
MAP	0	1.00	499.37	206.94	193.91	352.77	432.63	590.43	1128.11	▆▇▂▂▁
MAT	0	1.00	8.89	4.10	-0.59	5.88	9.17	12.44	16.87	▃▅▇▇▅
NDVI	0	1.00	0.44	0.16	0.14	0.31	0.42	0.56	0.80	▆▇▆▅▃

Skim only numeric columns

Code

df <- mf |>  dplyr::select(NLCD, SOC, DEM, MAP, MAT, NDVI) 
skim(df[, sapply(df, is.numeric)])

Data summary
Name	df[, sapply(df, is.numeri…
Number of rows	471
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
SOC	4	0.99	6.35	5.05	0.41	2.77	4.97	8.71	30.47	▇▃▁▁▁
DEM	0	1.00	1631.11	767.69	258.65	1175.33	1592.89	2234.26	3618.02	▅▇▇▅▁
MAP	0	1.00	499.37	206.94	193.91	352.77	432.63	590.43	1128.11	▆▇▂▂▁
MAT	0	1.00	8.89	4.10	-0.59	5.88	9.17	12.44	16.87	▃▅▇▇▅
NDVI	0	1.00	0.44	0.16	0.14	0.31	0.42	0.56	0.80	▆▇▆▅▃

Skim individual variables or subsets

Code

skim(df$SOC)

Data summary
Name	df$SOC
Number of rows	471
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	4	0.99	6.35	5.05	0.41	2.77	4.97	8.71	30.47	▇▃▁▁▁

Grouped Data Summaries

Code

# Group by NLCD and summarize
df |> 
  group_by(NLCD) |> 
  skimr::skim()

Data summary
Name	group_by(df, NLCD)
Number of rows	471
Number of columns	6
_______________________
Column type frequency:
numeric	5
________________________
Group variables	NLCD

Variable type: numeric

skim_variable	NLCD	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
SOC	Forest	0	1.00	10.43	6.80	1.33	4.93	8.97	15.30	30.47	▇▅▃▂▁
SOC	Herbaceous	1	0.99	5.48	3.93	0.41	2.62	4.61	6.97	18.81	▇▆▂▁▁
SOC	Planted/Cultivated	0	1.00	6.70	3.60	0.46	4.00	6.23	9.42	16.34	▅▇▆▃▁
SOC	Shrubland	3	0.98	4.13	3.74	0.45	1.35	3.00	5.66	19.10	▇▂▁▁▁
DEM	Forest	0	1.00	2567.36	335.76	1623.32	2326.66	2535.11	2769.29	3471.05	▁▆▇▃▂
DEM	Herbaceous	0	1.00	1362.70	589.24	277.66	1203.03	1355.11	1627.98	3618.02	▃▇▃▁▁
DEM	Planted/Cultivated	0	1.00	804.34	489.41	258.65	404.91	732.54	1038.99	2324.50	▇▅▃▁▁
DEM	Shrubland	0	1.00	1889.98	432.52	1013.55	1558.02	1961.23	2183.34	2723.78	▃▆▆▇▃
MAP	Forest	0	1.00	593.09	156.64	330.95	469.25	563.04	684.75	1121.27	▅▇▃▂▁
MAP	Herbaceous	0	1.00	472.54	188.98	205.06	365.02	414.31	469.75	1128.11	▇▇▁▂▁
MAP	Planted/Cultivated	0	1.00	646.73	232.53	193.91	465.05	587.50	827.85	1126.82	▂▇▃▅▂
MAP	Shrubland	0	1.00	353.53	108.73	201.51	288.59	337.54	391.26	1109.41	▇▃▁▁▁
MAT	Forest	0	1.00	4.72	3.04	-0.34	2.23	4.63	6.89	14.78	▇▇▇▂▁
MAT	Herbaceous	0	1.00	10.08	2.86	-0.59	7.69	10.07	12.34	15.46	▁▁▇▇▆
MAT	Planted/Cultivated	0	1.00	11.84	2.06	1.45	11.07	12.44	12.94	14.60	▁▁▁▅▇
MAT	Shrubland	0	1.00	8.28	4.57	0.38	4.71	6.98	12.45	16.87	▅▇▅▂▆
NDVI	Forest	0	1.00	0.57	0.12	0.28	0.53	0.58	0.65	0.78	▃▁▇▇▅
NDVI	Herbaceous	0	1.00	0.40	0.13	0.16	0.31	0.38	0.44	0.73	▃▇▅▂▂
NDVI	Planted/Cultivated	0	1.00	0.53	0.12	0.32	0.45	0.51	0.59	0.80	▅▇▆▂▃
NDVI	Shrubland	0	1.00	0.31	0.13	0.14	0.21	0.27	0.38	0.69	▇▅▂▂▁

Customizing Skim Output

skim_with() allows you to customize the summary statistics displayed by skim(). You can specify which functions to use for numeric, factor, and character data types.

Code

my_skim <- skim_with(numeric = sfl(median, mad), append = FALSE)
my_skim(df)

Data summary
Name	df
Number of rows	471
Number of columns	6
_______________________
Column type frequency:
character	1
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
NLCD	0	1	6	18	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	median	mad
SOC	4	0.99	NA	NA
DEM	0	1.00	1592.89	874.74
MAP	0	1.00	432.63	152.23
MAT	0	1.00	9.17	4.88
NDVI	0	1.00	0.42	0.19

Handling Different Data Types

Code

library(lubridate)

# Create a date column
data <- tibble(
  date = seq(as.Date("2023-01-01"), by = "month", length.out = 6),
  value = rnorm(6)
)

# Skim dates
skim(data)

Data summary
Name	data
Number of rows	6
Number of columns	2
_______________________
Column type frequency:
Date	1
numeric	1
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2023-01-01	2023-06-01	2023-03-16	6

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
value	0	1	-1.03	0.79	-1.76	-1.6	-1.27	-0.65	0.28	▇▇▁▃▃

Summary and Conclusion

This tutorial guides you through the process of using the {skimr} package for exploratory data analysis (EDA) in R. The {skimr} package is a powerful tool that provides a comprehensive overview of your dataset, allowing you to quickly understand its structure and characteristics. It generates summary statistics and visualizations, making it easier to identify patterns, trends, and potential issues in your data. The end of this tutorial, you should be able to use the {skimr} package to perform EDA on your own datasets and gain valuable insights into their characteristics.

Data Exploration with {skimr} in R

Introduction

Check and Install Required Packages

Verify Installation

Load Packages

Check Loaded Packages

Data

Getting Started with skimr()

Skim only numeric columns

Skim individual variables or subsets

Grouped Data Summaries

Customizing Skim Output

Handling Different Data Types

Summary and Conclusion

References

Getting Started with `skimr()`