Code
packages <- c('tidyverse',
'skimr'
)This tutorial will guide you through the process of using the {skimr} package for exploratory data analysis (EDA) in R. The {skimr} package is a powerful tool that provides a comprehensive overview of your dataset, allowing you to quickly understand its structure and characteristics. It generates summary statistics and visualizations, making it easier to identify patterns, trends, and potential issues in your data.
{skimr} provides a frictionless approach to summary statistics that conforms to the principle of least surprise. It displays summary statistics that the user can skim quickly to understand their data. It handles different data types and returns a skim_df object that can be included in a pipeline or displayed nicely for the human reader.

#| warning: false
#| error: false
# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
Successfully loaded packages:
[1] "package:skimr" "package:lubridate" "package:forcats"
[4] "package:stringr" "package:dplyr" "package:purrr"
[7] "package:readr" "package:tidyr" "package:tibble"
[10] "package:ggplot2" "package:tidyverse" "package:stats"
[13] "package:graphics" "package:grDevices" "package:utils"
[16] "package:datasets" "package:methods" "package:base"
All data set use in this exercise can be downloaded from here We will use read_csv() function of {readr} package to import data as a tidy data.
skimr()The skim function is the main function provided by the skimr package. It generates a summary of the dataset, including key statistics for each variable.
You’ll get output grouped by data type (numeric, factor, etc.), showing:
Count of missing values
Mean, sd, min, max, and percentiles
Histograms (in console!)
Unique counts (for factors)
| Name | dplyr::select(mf, NLCD, S… |
| Number of rows | 471 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| NLCD | 0 | 1 | 6 | 18 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| SOC | 4 | 0.99 | 6.35 | 5.05 | 0.41 | 2.77 | 4.97 | 8.71 | 30.47 | ▇▃▁▁▁ |
| DEM | 0 | 1.00 | 1631.11 | 767.69 | 258.65 | 1175.33 | 1592.89 | 2234.26 | 3618.02 | ▅▇▇▅▁ |
| MAP | 0 | 1.00 | 499.37 | 206.94 | 193.91 | 352.77 | 432.63 | 590.43 | 1128.11 | ▆▇▂▂▁ |
| MAT | 0 | 1.00 | 8.89 | 4.10 | -0.59 | 5.88 | 9.17 | 12.44 | 16.87 | ▃▅▇▇▅ |
| NDVI | 0 | 1.00 | 0.44 | 0.16 | 0.14 | 0.31 | 0.42 | 0.56 | 0.80 | ▆▇▆▅▃ |
| Name | df[, sapply(df, is.numeri… |
| Number of rows | 471 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| SOC | 4 | 0.99 | 6.35 | 5.05 | 0.41 | 2.77 | 4.97 | 8.71 | 30.47 | ▇▃▁▁▁ |
| DEM | 0 | 1.00 | 1631.11 | 767.69 | 258.65 | 1175.33 | 1592.89 | 2234.26 | 3618.02 | ▅▇▇▅▁ |
| MAP | 0 | 1.00 | 499.37 | 206.94 | 193.91 | 352.77 | 432.63 | 590.43 | 1128.11 | ▆▇▂▂▁ |
| MAT | 0 | 1.00 | 8.89 | 4.10 | -0.59 | 5.88 | 9.17 | 12.44 | 16.87 | ▃▅▇▇▅ |
| NDVI | 0 | 1.00 | 0.44 | 0.16 | 0.14 | 0.31 | 0.42 | 0.56 | 0.80 | ▆▇▆▅▃ |
| Name | df$SOC |
| Number of rows | 471 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 4 | 0.99 | 6.35 | 5.05 | 0.41 | 2.77 | 4.97 | 8.71 | 30.47 | ▇▃▁▁▁ |
| Name | group_by(df, NLCD) |
| Number of rows | 471 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| numeric | 5 |
| ________________________ | |
| Group variables | NLCD |
Variable type: numeric
| skim_variable | NLCD | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SOC | Forest | 0 | 1.00 | 10.43 | 6.80 | 1.33 | 4.93 | 8.97 | 15.30 | 30.47 | ▇▅▃▂▁ |
| SOC | Herbaceous | 1 | 0.99 | 5.48 | 3.93 | 0.41 | 2.62 | 4.61 | 6.97 | 18.81 | ▇▆▂▁▁ |
| SOC | Planted/Cultivated | 0 | 1.00 | 6.70 | 3.60 | 0.46 | 4.00 | 6.23 | 9.42 | 16.34 | ▅▇▆▃▁ |
| SOC | Shrubland | 3 | 0.98 | 4.13 | 3.74 | 0.45 | 1.35 | 3.00 | 5.66 | 19.10 | ▇▂▁▁▁ |
| DEM | Forest | 0 | 1.00 | 2567.36 | 335.76 | 1623.32 | 2326.66 | 2535.11 | 2769.29 | 3471.05 | ▁▆▇▃▂ |
| DEM | Herbaceous | 0 | 1.00 | 1362.70 | 589.24 | 277.66 | 1203.03 | 1355.11 | 1627.98 | 3618.02 | ▃▇▃▁▁ |
| DEM | Planted/Cultivated | 0 | 1.00 | 804.34 | 489.41 | 258.65 | 404.91 | 732.54 | 1038.99 | 2324.50 | ▇▅▃▁▁ |
| DEM | Shrubland | 0 | 1.00 | 1889.98 | 432.52 | 1013.55 | 1558.02 | 1961.23 | 2183.34 | 2723.78 | ▃▆▆▇▃ |
| MAP | Forest | 0 | 1.00 | 593.09 | 156.64 | 330.95 | 469.25 | 563.04 | 684.75 | 1121.27 | ▅▇▃▂▁ |
| MAP | Herbaceous | 0 | 1.00 | 472.54 | 188.98 | 205.06 | 365.02 | 414.31 | 469.75 | 1128.11 | ▇▇▁▂▁ |
| MAP | Planted/Cultivated | 0 | 1.00 | 646.73 | 232.53 | 193.91 | 465.05 | 587.50 | 827.85 | 1126.82 | ▂▇▃▅▂ |
| MAP | Shrubland | 0 | 1.00 | 353.53 | 108.73 | 201.51 | 288.59 | 337.54 | 391.26 | 1109.41 | ▇▃▁▁▁ |
| MAT | Forest | 0 | 1.00 | 4.72 | 3.04 | -0.34 | 2.23 | 4.63 | 6.89 | 14.78 | ▇▇▇▂▁ |
| MAT | Herbaceous | 0 | 1.00 | 10.08 | 2.86 | -0.59 | 7.69 | 10.07 | 12.34 | 15.46 | ▁▁▇▇▆ |
| MAT | Planted/Cultivated | 0 | 1.00 | 11.84 | 2.06 | 1.45 | 11.07 | 12.44 | 12.94 | 14.60 | ▁▁▁▅▇ |
| MAT | Shrubland | 0 | 1.00 | 8.28 | 4.57 | 0.38 | 4.71 | 6.98 | 12.45 | 16.87 | ▅▇▅▂▆ |
| NDVI | Forest | 0 | 1.00 | 0.57 | 0.12 | 0.28 | 0.53 | 0.58 | 0.65 | 0.78 | ▃▁▇▇▅ |
| NDVI | Herbaceous | 0 | 1.00 | 0.40 | 0.13 | 0.16 | 0.31 | 0.38 | 0.44 | 0.73 | ▃▇▅▂▂ |
| NDVI | Planted/Cultivated | 0 | 1.00 | 0.53 | 0.12 | 0.32 | 0.45 | 0.51 | 0.59 | 0.80 | ▅▇▆▂▃ |
| NDVI | Shrubland | 0 | 1.00 | 0.31 | 0.13 | 0.14 | 0.21 | 0.27 | 0.38 | 0.69 | ▇▅▂▂▁ |
skim_with() allows you to customize the summary statistics displayed by skim(). You can specify which functions to use for numeric, factor, and character data types.
| Name | df |
| Number of rows | 471 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| NLCD | 0 | 1 | 6 | 18 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | median | mad |
|---|---|---|---|---|
| SOC | 4 | 0.99 | NA | NA |
| DEM | 0 | 1.00 | 1592.89 | 874.74 |
| MAP | 0 | 1.00 | 432.63 | 152.23 |
| MAT | 0 | 1.00 | 9.17 | 4.88 |
| NDVI | 0 | 1.00 | 0.42 | 0.19 |
| Name | data |
| Number of rows | 6 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2023-01-01 | 2023-06-01 | 2023-03-16 | 6 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| value | 0 | 1 | -0.04 | 1.12 | -1.79 | -0.53 | 0.17 | 0.43 | 1.48 | ▃▃▃▇▃ |
This tutorial guides you through the process of using the {skimr} package for exploratory data analysis (EDA) in R. The {skimr} package is a powerful tool that provides a comprehensive overview of your dataset, allowing you to quickly understand its structure and characteristics. It generates summary statistics and visualizations, making it easier to identify patterns, trends, and potential issues in your data. The end of this tutorial, you should be able to use the {skimr} package to perform EDA on your own datasets and gain valuable insights into their characteristics.