Code
<- c('tidyverse',
packages 'skimr'
)
This tutorial will guide you through the process of using the {skimr} package for exploratory data analysis (EDA) in R. The {skimr} package is a powerful tool that provides a comprehensive overview of your dataset, allowing you to quickly understand its structure and characteristics. It generates summary statistics and visualizations, making it easier to identify patterns, trends, and potential issues in your data.
{skimr} provides a frictionless approach to summary statistics that conforms to the principle of least surprise. It displays summary statistics that the user can skim quickly to understand their data. It handles different data types and returns a skim_df object that can be included in a pipeline or displayed nicely for the human reader.
#| warning: false
#| error: false
# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
Successfully loaded packages:
[1] "package:skimr" "package:lubridate" "package:forcats"
[4] "package:stringr" "package:dplyr" "package:purrr"
[7] "package:readr" "package:tidyr" "package:tibble"
[10] "package:ggplot2" "package:tidyverse" "package:stats"
[13] "package:graphics" "package:grDevices" "package:utils"
[16] "package:datasets" "package:methods" "package:base"
The data set use in this exercise can be downloaded from my Dropbox or from my Github account.
We will use read_csv()
function of readr package to import data as a tidy data.
skimr()
The skim
function is the main function provided by the skimr
package. It generates a summary of the dataset, including key statistics for each variable.
You’ll get output grouped by data type (numeric, factor, etc.), showing:
Count of missing values
Mean, sd, min, max, and percentiles
Histograms (in console!)
Unique counts (for factors)
Name | dplyr::select(mf, NLCD, S… |
Number of rows | 471 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
NLCD | 0 | 1 | 6 | 18 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
SOC | 4 | 0.99 | 6.35 | 5.05 | 0.41 | 2.77 | 4.97 | 8.71 | 30.47 | ▇▃▁▁▁ |
DEM | 0 | 1.00 | 1631.11 | 767.69 | 258.65 | 1175.33 | 1592.89 | 2234.26 | 3618.02 | ▅▇▇▅▁ |
MAP | 0 | 1.00 | 499.37 | 206.94 | 193.91 | 352.77 | 432.63 | 590.43 | 1128.11 | ▆▇▂▂▁ |
MAT | 0 | 1.00 | 8.89 | 4.10 | -0.59 | 5.88 | 9.17 | 12.44 | 16.87 | ▃▅▇▇▅ |
NDVI | 0 | 1.00 | 0.44 | 0.16 | 0.14 | 0.31 | 0.42 | 0.56 | 0.80 | ▆▇▆▅▃ |
Name | df[, sapply(df, is.numeri… |
Number of rows | 471 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
SOC | 4 | 0.99 | 6.35 | 5.05 | 0.41 | 2.77 | 4.97 | 8.71 | 30.47 | ▇▃▁▁▁ |
DEM | 0 | 1.00 | 1631.11 | 767.69 | 258.65 | 1175.33 | 1592.89 | 2234.26 | 3618.02 | ▅▇▇▅▁ |
MAP | 0 | 1.00 | 499.37 | 206.94 | 193.91 | 352.77 | 432.63 | 590.43 | 1128.11 | ▆▇▂▂▁ |
MAT | 0 | 1.00 | 8.89 | 4.10 | -0.59 | 5.88 | 9.17 | 12.44 | 16.87 | ▃▅▇▇▅ |
NDVI | 0 | 1.00 | 0.44 | 0.16 | 0.14 | 0.31 | 0.42 | 0.56 | 0.80 | ▆▇▆▅▃ |
Name | df$SOC |
Number of rows | 471 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
data | 4 | 0.99 | 6.35 | 5.05 | 0.41 | 2.77 | 4.97 | 8.71 | 30.47 | ▇▃▁▁▁ |
Name | group_by(df, NLCD) |
Number of rows | 471 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
numeric | 5 |
________________________ | |
Group variables | NLCD |
Variable type: numeric
skim_variable | NLCD | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
SOC | Forest | 0 | 1.00 | 10.43 | 6.80 | 1.33 | 4.93 | 8.97 | 15.30 | 30.47 | ▇▅▃▂▁ |
SOC | Herbaceous | 1 | 0.99 | 5.48 | 3.93 | 0.41 | 2.62 | 4.61 | 6.97 | 18.81 | ▇▆▂▁▁ |
SOC | Planted/Cultivated | 0 | 1.00 | 6.70 | 3.60 | 0.46 | 4.00 | 6.23 | 9.42 | 16.34 | ▅▇▆▃▁ |
SOC | Shrubland | 3 | 0.98 | 4.13 | 3.74 | 0.45 | 1.35 | 3.00 | 5.66 | 19.10 | ▇▂▁▁▁ |
DEM | Forest | 0 | 1.00 | 2567.36 | 335.76 | 1623.32 | 2326.66 | 2535.11 | 2769.29 | 3471.05 | ▁▆▇▃▂ |
DEM | Herbaceous | 0 | 1.00 | 1362.70 | 589.24 | 277.66 | 1203.03 | 1355.11 | 1627.98 | 3618.02 | ▃▇▃▁▁ |
DEM | Planted/Cultivated | 0 | 1.00 | 804.34 | 489.41 | 258.65 | 404.91 | 732.54 | 1038.99 | 2324.50 | ▇▅▃▁▁ |
DEM | Shrubland | 0 | 1.00 | 1889.98 | 432.52 | 1013.55 | 1558.02 | 1961.23 | 2183.34 | 2723.78 | ▃▆▆▇▃ |
MAP | Forest | 0 | 1.00 | 593.09 | 156.64 | 330.95 | 469.25 | 563.04 | 684.75 | 1121.27 | ▅▇▃▂▁ |
MAP | Herbaceous | 0 | 1.00 | 472.54 | 188.98 | 205.06 | 365.02 | 414.31 | 469.75 | 1128.11 | ▇▇▁▂▁ |
MAP | Planted/Cultivated | 0 | 1.00 | 646.73 | 232.53 | 193.91 | 465.05 | 587.50 | 827.85 | 1126.82 | ▂▇▃▅▂ |
MAP | Shrubland | 0 | 1.00 | 353.53 | 108.73 | 201.51 | 288.59 | 337.54 | 391.26 | 1109.41 | ▇▃▁▁▁ |
MAT | Forest | 0 | 1.00 | 4.72 | 3.04 | -0.34 | 2.23 | 4.63 | 6.89 | 14.78 | ▇▇▇▂▁ |
MAT | Herbaceous | 0 | 1.00 | 10.08 | 2.86 | -0.59 | 7.69 | 10.07 | 12.34 | 15.46 | ▁▁▇▇▆ |
MAT | Planted/Cultivated | 0 | 1.00 | 11.84 | 2.06 | 1.45 | 11.07 | 12.44 | 12.94 | 14.60 | ▁▁▁▅▇ |
MAT | Shrubland | 0 | 1.00 | 8.28 | 4.57 | 0.38 | 4.71 | 6.98 | 12.45 | 16.87 | ▅▇▅▂▆ |
NDVI | Forest | 0 | 1.00 | 0.57 | 0.12 | 0.28 | 0.53 | 0.58 | 0.65 | 0.78 | ▃▁▇▇▅ |
NDVI | Herbaceous | 0 | 1.00 | 0.40 | 0.13 | 0.16 | 0.31 | 0.38 | 0.44 | 0.73 | ▃▇▅▂▂ |
NDVI | Planted/Cultivated | 0 | 1.00 | 0.53 | 0.12 | 0.32 | 0.45 | 0.51 | 0.59 | 0.80 | ▅▇▆▂▃ |
NDVI | Shrubland | 0 | 1.00 | 0.31 | 0.13 | 0.14 | 0.21 | 0.27 | 0.38 | 0.69 | ▇▅▂▂▁ |
skim_with()
allows you to customize the summary statistics displayed by skim()
. You can specify which functions to use for numeric, factor, and character data types.
Name | df |
Number of rows | 471 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
NLCD | 0 | 1 | 6 | 18 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | median | mad |
---|---|---|---|---|
SOC | 4 | 0.99 | NA | NA |
DEM | 0 | 1.00 | 1592.89 | 874.74 |
MAP | 0 | 1.00 | 432.63 | 152.23 |
MAT | 0 | 1.00 | 9.17 | 4.88 |
NDVI | 0 | 1.00 | 0.42 | 0.19 |
Name | data |
Number of rows | 6 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
Date | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2023-01-01 | 2023-06-01 | 2023-03-16 | 6 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
value | 0 | 1 | -1.03 | 0.79 | -1.76 | -1.6 | -1.27 | -0.65 | 0.28 | ▇▇▁▃▃ |
This tutorial guides you through the process of using the {skimr} package for exploratory data analysis (EDA) in R. The {skimr} package is a powerful tool that provides a comprehensive overview of your dataset, allowing you to quickly understand its structure and characteristics. It generates summary statistics and visualizations, making it easier to identify patterns, trends, and potential issues in your data. The end of this tutorial, you should be able to use the {skimr} package to perform EDA on your own datasets and gain valuable insights into their characteristics.