Data Exploration with {dlookr} in R

This notebook is a tutorial on how to use the {dlookr} package for exploratory data analysis (EDA) in R. The {dlookr} package provides a comprehensive set of tools for diagnosing, exploring, and transforming data. It includes functions for data quality diagnosis, descriptive statistics, correlation analysis, and data transformation.

Introduction

The {dlookr} is a collection of tools that support data diagnosis, exploration, and transformation. Data diagnostics provides information and visualization of missing values and outliers and unique and negative values to help you understand the distribution and quality of your data. Data exploration provides information and visualization of the descriptive statistics of univariate variables, normality tests and outliers, correlation of two variables, and relationship between target variable and predictor. Data transformation supports binning for categorizing continuous variables, imputates missing values and outliers, resolving skewness. And it creates automated reports that support these three tasks.

Features:

Diagnose data quality.
Find appropriate scenarios to pursuit the follow-up analysis through data exploration and understanding.
Derive new variables or perform variable transformations.
Automatically generate reports for the above three tasks.
Supports quality diagnosis and EDA of table of DBMS

Usage

Data quality diagnosis for data.frame, tbl_df, and table of DBMS
Exploratory Data Analysis for data.frame, tbl_df, and table of DBMS
Data Transformation
Data diagnosis and EDA for table of DBMS

Installation

install.packages(c("nloptr", "lme4", "jomo", "mitml", 'mice', 'devtools'))
devtools::install_github("choonghyunryu/dlookr")

Check and Install Required Packages

Code

packages <- c('tidyverse', 
         'dlookr',
         'flextable'
         )

#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Verify Installation

Code

# Verify installation
cat("Installed packages:\n")

Installed packages:

Code

print(sapply(packages, requireNamespace, quietly = TRUE))

Registered S3 methods overwritten by 'dlookr':
  method          from  
  plot.transform  scales
  print.transform scales

tidyverse    dlookr flextable 
     TRUE      TRUE      TRUE

Load Packages

Code

# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Check Loaded Packages

Code

# Check loaded packages
cat("Successfully loaded packages:\n")

Successfully loaded packages:

Code

print(search()[grepl("package:", search())])

 [1] "package:flextable" "package:dlookr"    "package:lubridate"
 [4] "package:forcats"   "package:stringr"   "package:dplyr"    
 [7] "package:purrr"     "package:readr"     "package:tidyr"    
[10] "package:tibble"    "package:ggplot2"   "package:tidyverse"
[13] "package:stats"     "package:graphics"  "package:grDevices"
[16] "package:utils"     "package:datasets"  "package:methods"  
[19] "package:base"

Data

The data set use in this exercise can be downloaded from my Dropbox or from my Github account.

We will use read_csv() function of readr package to import data as a tidy data.

Code

mf<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/gp_soil_data_na.csv")
as_df<-read.csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/rice_arsenic_data.csv", header=TRUE)

Data Quality Diagnosis

General diagnosis of all variables

Data Quality Diagnosis is the first step before any statistical analysis. We use diagnose() function of dlookr package to do general General diagnosis of all variables.

The variables of the tbl_df object returned by diagnose () are as follows:

variables : variable names
types : the data type of the variables
missing_count : number of missing values
missing_percent : percentage of missing values
unique_count : number of unique values
unique_rate : rate of unique value. unique_count / number of observation

Code

#library (flextable)
dlookr::diagnose(mf) |>  
  flextable()

variables	types	missing_count	missing_percent	unique_count	unique_rate
ID	numeric	0	0.0000000	471	1.000000000
FIPS	numeric	0	0.0000000	172	0.365180467
STATE_ID	numeric	0	0.0000000	4	0.008492569
STATE	character	0	0.0000000	4	0.008492569
COUNTY	character	0	0.0000000	161	0.341825902
Longitude	numeric	0	0.0000000	471	1.000000000
Latitude	numeric	0	0.0000000	471	1.000000000
SOC	numeric	4	0.8492569	457	0.970276008
DEM	numeric	0	0.0000000	464	0.985138004
Aspect	numeric	0	0.0000000	464	0.985138004
Slope	numeric	0	0.0000000	464	0.985138004
TPI	numeric	0	0.0000000	464	0.985138004
KFactor	numeric	0	0.0000000	386	0.819532909
MAP	numeric	0	0.0000000	464	0.985138004
MAT	numeric	0	0.0000000	463	0.983014862
NDVI	numeric	0	0.0000000	464	0.985138004
SiltClay	numeric	0	0.0000000	462	0.980891720
NLCD	character	0	0.0000000	4	0.008492569
FRG	character	0	0.0000000	6	0.012738854

Missing Value(NA) : Variables with many missing values, i.e. those with a missing_percent close to 100, should be excluded from the analysis.

Unique value : Variables with a unique value (unique_count = 1) are considered to be excluded from data analysis. And if the data type is not numeric (integer, numeric) and the number of unique values is equal to the number of observations (unique_rate = 1), then the variable is likely to be an identifier. Therefore, this variable is also not suitable for the analysis model.

Diagnosis of Numeric Variables

We may use diagnose_numeric(), diagnoses numeric(continuous and discrete) variables in a data frame returns more diagnostic information such as:

min : minimum value
Q1 : 1/4 quartile, 25th percentile
mean : arithmetic mean
median : median, 50th percentile
Q3 : 3/4 quartile, 75th percentile
max : maximum value
zero : number of observations with a value of 0
minus : number of observations with negative numbers
outlier : number of outliers

Code

# First select  numerical columns
mf |> 
  dplyr::select(SOC, DEM, Slope, Aspect, TPI, KFactor, MAP, MAT, NDVI, SiltClay)  |> 
# then diagnose them
  dlookr::diagnose_numeric()  |> 
  flextable()

variables	min	Q1	mean	median	Q3	max	minus	outlier
SOC	0.4080000	2.7695000	6.3507623126	4.97100000	8.7135000	30.4730000	0	20
DEM	258.6488037	1,175.3313595	1,631.1063060667	1,592.89318800	2,234.2648930	3,618.0241700	0	0
Slope	0.6492527	1.4506671	4.8267398902	2.72667742	7.1070788	26.1041622	0	20
Aspect	86.8945694	148.8052292	165.4676589153	164.07072450	179.0842895	255.8335266	0	8
TPI	-26.7086506	-0.8160543	-0.0006690991	-0.04758827	0.8490718	16.7062569	241	85
KFactor	0.0500000	0.1933357	0.2558965090	0.28000000	0.3200000	0.4300000	0	0
MAP	193.9132233	352.7745056	499.3729530231	432.63040160	590.4269104	1,128.1145020	0	17
MAT	-0.5910638	5.8800533	8.8855210982	9.17283535	12.4442859	16.8742866	6	0
NDVI	0.1424335	0.3053468	0.4354311144	0.41568252	0.5559025	0.7969922	0	0
SiltClay	9.1619568	42.7587299	53.6779156034	52.11276627	62.8508625	89.8344116	0	3

Diagnosis of Categorical Variables

diagnose_category() diagnoses the categorical(factor, ordered, character) variables of a data frame. The usage is similar to diagnose() but returns more diagnostic information such as:

variables : variable names
levels: level names
N : number of observation
freq : number of observation at the levels
ratio : percentage of observation at the levels
rank : rank of occupancy ratio of levels

Code

mf  |>  
# Select categorical variables
  dplyr::select(STATE, NLCD,FRG)  |> 
# then diagnose them
  dlookr::diagnose_category() |>  
  flextable()

variables	levels	N	freq	ratio	rank
STATE	Colorado	471	136	28.874735	1
STATE	Wyoming	471	120	25.477707	2
STATE	New Mexico	471	109	23.142251	3
STATE	Kansas	471	106	22.505308	4
NLCD	Herbaceous	471	151	32.059448	1
NLCD	Shrubland	471	130	27.600849	2
NLCD	Planted/Cultivated	471	97	20.594480	3
NLCD	Forest	471	93	19.745223	4
FRG	Fire Regime Group II	471	252	53.503185	1
FRG	Fire Regime Group III	471	100	21.231423	2
FRG	Fire Regime Group IV	471	75	15.923567	3
FRG	Fire Regime Group I	471	19	4.033970	4
FRG	Fire Regime Group V	471	18	3.821656	5
FRG	Indeterminate FRG	471	7	1.486200	6

Diagnosing Outliers

diagnose_outlier() diagnoses the outliers of the numeric (continuous and discrete) variables of the data frame.

outliers_cnt : number of outliers
outliers_ratio : percent of outliers
outliers_mean : arithmetic average of outliers
with_mean : arithmetic average of with outliers
without_mean : arithmetic average of without outliers

The diagnose_outlier() produces outlier information for diagnosing the quality of the numerical data.

Code

mf  |>  
  dlookr::diagnose_outlier(SOC, DEM, SOC, Slope, 
                           Aspect, TPI, KFactor, MAP, MAT, NDVI, SiltClay)

# A tibble: 10 × 6
   variables outliers_cnt outliers_ratio outliers_mean   with_mean without_mean
   <chr>            <int>          <dbl>         <dbl>       <dbl>        <dbl>
 1 SOC                 20          4.25         21.1      6.35           5.69  
 2 DEM                  0          0           NaN     1631.          1631.    
 3 Slope               20          4.25         18.9      4.83           4.20  
 4 Aspect               8          1.70        224.     165.           164.    
 5 TPI                 85         18.0           0.291   -0.000669      -0.0649
 6 KFactor              0          0           NaN        0.256          0.256 
 7 MAP                 17          3.61       1049.     499.           479.    
 8 MAT                  0          0           NaN        8.89           8.89  
 9 NDVI                 0          0           NaN        0.435          0.435 
10 SiltClay             3          0.637         9.71    53.7           54.0

Visualization of Outliers

plot_outlier() visualizes outliers of numerical variables(continuous and discrete) of data.frame. Usage is the same diagnose().

The plot derived from the numerical data diagnosis is as follows.

With outliers box plot
Without outliers box plot
With outliers histogram
Without outliers histogram

The following example uses plot_outlier() after diagnose_outlier(), and filter and select functions with dplyr packages to visualize this with an outlier ratio of 0.5% or higher.

Code

mf  |> 
  dlookr::plot_outlier(dlookr::diagnose_outlier(mf,SOC) |> 
                 dplyr::filter(outliers_ratio >= 0.5)  |>  
                 dplyr::select(variables)  |>  
                 unlist())

Daignosis Normality

Normality Test

normality() function of dlookr performs a normality test on multiple numerical data. Shapiro-Wilk normality test is performed. When the number of observations is greater than 5000, it is tested after extracting 5000 samples by random simple sampling.

The variables of tbl_df object returned by normality() are as follows.

statistic : Statistics of the Shapiro-Wilk test
p_value : p-value of the Shapiro-Wilk test
sample : Number of sample observations performed Shapiro-Wilk test

Code

mf  |>  
  dplyr::select(SOC, DEM, MAP, MAT, NDVI)  |> 
  dlookr::normality()  |> 
  # sort variables that do not follow a normal distribution in order of p_value:
  dplyr::filter(p_value <= 0.01)  |>  
  dplyr::arrange(abs(p_value))  |> 
  flextable()

vars	statistic	p_value	sample
SOC	0.8723726	0.0000000000000000003914388	471
MAP	0.8970027	0.0000000000000000287526353	471
NDVI	0.9698609	0.0000000291124642567419318	471
DEM	0.9731601	0.0000001328862497056715322	471
MAT	0.9732952	0.0000001417229160828437308	471

The normality() function supports the group_by() function syntax in the dplyr package.

Code

mf %>% 
  dplyr::group_by(NLCD)  |> 
  dlookr::normality(SOC)  |> 
  dplyr:: arrange(desc(p_value))  |> 
  flextable()

variable	NLCD	statistic	p_value	sample
SOC	Planted/Cultivated	0.9693901	0.02290858185622282	97
SOC	Forest	0.9264632	0.00005866058197987	93
SOC	Herbaceous	0.8892600	0.00000000342072181	151
SOC	Shrubland	0.8207045	0.00000000003809702	130

Visualization of Normality

We may also use plot_normality() function of dlookr package to visualizes the normality of numeric data. The information that plot_normality() visualizes is as follows.

Histogram of original data
Q-Q plot of original data
histogram of log transformed data
Histogram of square root transformed data

Code

mf |>  dlookr::plot_normality(SOC)

Descriptive Statistics

The describe() function from dloookr package computes descriptive statistics for numerical data. The descriptive statistics help determine the distribution of numerical variables.

The variables of the tbl_df object returned by describe() are as follows.

n : number of observations excluding missing values
na : number of missing values
mean : arithmetic average
sd : standard deviation
se_mean : standard error mean. sd/sqrt(n)
IQR : interquartile range (Q3-Q1)
skewness : skewness
kurtosis : kurtosis
p25 : Q1. 25% percentile
p50 : median. 50% percentile
p75 : Q3. 75% percentile
p01, p05, p10, p20, p30 : 1%, 5%, 20%, 30% percentiles
p40, p60, p70, p80 : 40%, 60%, 70%, 80% percentiles
p90, p95, p99, p100 : 90%, 95%, 99%, 100% percentiles

Code

# First select  numerical columns
des.stata<-mf  |>  
  dplyr::select(SOC, DEM, MAP, MAT, NDVI) |> 
 # then descrive them
  dlookr::describe()
flextable(des.stata)

described_variables	n	na	mean	sd	se_mean	IQR	skewness	kurtosis	p00	p01	p05	p10	p20	p25	p30	p40	p50	p60	p70	p75	p80	p90	p95	p99	p100
SOC	467	4	6.3507623	5.0454091	0.233473691	5.9440000	1.46472837	2.4271923	0.4080000	0.4909400	0.9637000	1.2902000	2.3294000	2.7695000	3.1114000	3.9906000	4.9710000	6.1266000	7.5030000	8.7135000	10.0522000	13.3830000	16.5219000	22.1247600	30.4730000
DEM	471	0	1,631.1063061	767.6923254	35.373395140	1,058.9335335	-0.02350235	-0.8039161	258.6488037	288.9806610	353.7696533	441.8609924	925.0328369	1,175.3313595	1,271.5794680	1,400.3239750	1,592.8931880	1,876.8692630	2,164.8010250	2,234.2648930	2,334.0241700	2,620.1455080	2,797.0867920	3,157.0538086	3,618.0241700
MAP	471	0	499.3729530	206.9359198	9.535103866	237.6524048	1.08253930	0.4698226	193.9132233	205.0283020	261.5091248	290.6307068	340.8447266	352.7745056	371.9525452	404.3808899	432.6304016	471.3896484	557.4978027	590.4269104	663.0267944	835.6693726	927.7701416	1,102.3618041	1,128.1145020
MAT	471	0	8.8855211	4.0981336	0.188832030	6.5642326	-0.27522458	-0.8236567	-0.5910638	-0.1158469	1.6258606	2.9455154	5.0193229	5.8800533	6.8264499	7.5748353	9.1728353	10.5665998	11.8026342	12.4442859	12.7931766	13.9399471	14.6389923	16.2291578	16.8742866
NDVI	471	0	0.4354311	0.1620239	0.007465669	0.2505557	0.23375088	-0.9180418	0.1424335	0.1631678	0.1920738	0.2215059	0.2756113	0.3053468	0.3317074	0.3769788	0.4156825	0.4773465	0.5348566	0.5559025	0.5853671	0.6759516	0.7216224	0.7601239	0.7969922

The describe() function supports the group_by() function syntax of the dplyr package. Following function calculate descriptive testatrices of SOC and NDVI of different NLCD

Code

mf %>%
  group_by(NLCD) |> 
  dlookr::describe(SOC, NDVI) |> 
  flextable()

described_variables	NLCD	n	na	mean	sd	se_mean	IQR	skewness	kurtosis	p00	p01	p05	p10	p20	p25	p30	p40	p50	p60	p70	p75	p80	p90	p95	p99	p100
NDVI	Forest	93	0	0.5705648	0.1155016	0.01197696	0.1165678	-0.6719584	0.1157272	0.2830779	0.2856635	0.3427034	0.3610759	0.5045352	0.5326702	0.5390721	0.5576768	0.5759758	0.6085463	0.6290170	0.6492380	0.6708930	0.7010803	0.7353695	0.7711932	0.7814745
NDVI	Herbaceous	151	0	0.4003131	0.1307054	0.01063666	0.1257634	0.9764992	0.4084170	0.1648289	0.1785699	0.2555612	0.2651424	0.2944161	0.3124843	0.3308861	0.3497368	0.3769788	0.3934043	0.4193905	0.4382476	0.4771240	0.6033114	0.6916735	0.7305025	0.7337248
NDVI	Planted/Cultivated	97	0	0.5332255	0.1213052	0.01231668	0.1373269	0.5177150	-0.6469923	0.3249635	0.3262391	0.3740784	0.3937608	0.4236221	0.4498498	0.4584951	0.4879352	0.5132312	0.5297822	0.5670107	0.5871767	0.6677204	0.7226615	0.7493749	0.7969922	0.7969922
NDVI	Shrubland	130	0	0.3065798	0.1295559	0.01136280	0.1688918	1.1160872	0.4843057	0.1424335	0.1501656	0.1663934	0.1871671	0.2014691	0.2101985	0.2158819	0.2352937	0.2691359	0.2868186	0.3441049	0.3790904	0.4145865	0.5239458	0.5461648	0.6790463	0.6939532
SOC	Forest	93	0	10.4308817	6.8021471	0.70534979	10.3720000	0.7778065	-0.1552659	1.3330000	1.3578400	2.3656000	3.3538000	4.4686000	4.9310000	5.1440000	6.5884000	8.9740000	11.1936000	13.7130000	15.3030000	16.6020000	20.8730000	22.2096000	28.1831200	30.4730000
SOC	Herbaceous	150	1	5.4769667	3.9250913	0.32048236	4.3425000	1.2775572	1.4077233	0.4080000	0.5794500	1.0895000	1.4265000	2.2820000	2.6240000	3.0726000	3.5692000	4.6090000	5.2000000	6.3490000	6.9665000	8.2500000	11.0768000	13.3133500	17.5073200	18.8140000
SOC	Planted/Cultivated	97	0	6.6967216	3.5983014	0.36535215	5.4170000	0.5350421	-0.2587066	0.4620000	0.7192800	1.6380000	2.4642000	3.6094000	4.0020000	4.3260000	5.3820000	6.2300000	7.2166000	8.0810000	9.4190000	10.1170000	11.3962000	13.1846000	15.7638400	16.3360000
SOC	Shrubland	127	3	4.1307638	3.7448591	0.33230251	4.3105000	1.7061821	2.9893653	0.4460000	0.4746400	0.6162000	0.8046000	1.1344000	1.3500000	1.6174000	2.4386000	2.9960000	3.7474000	4.9050000	5.6605000	6.2840000	9.0786000	12.2261000	15.8518000	19.0990000

Correlation Analysis

correlate() calculates the correlation coefficient of all combinations of several numerical variables as follows:

Code

# First select  numerical columns
mf  |>  
  dplyr::select(SOC, DEM, MAP, MAT, NDVI) |> 
# then diagnose them
  dlookr::correlate() |>  
  flextable()

var1	var2	coef_corr
DEM	SOC	0.16668949
MAP	SOC	0.49886194
MAT	SOC	-0.35802586
NDVI	SOC	0.58704521
SOC	DEM	0.16668949
MAP	DEM	-0.30672789
MAT	DEM	-0.80753239
NDVI	DEM	-0.06733319
SOC	MAP	0.49886194
DEM	MAP	-0.30672789
MAT	MAP	0.06032649
NDVI	MAP	0.80528269
SOC	MAT	-0.35802586
DEM	MAT	-0.80753239
MAP	MAT	0.06032649
NDVI	MAT	-0.20967449
SOC	NDVI	0.58704521
DEM	NDVI	-0.06733319
MAP	NDVI	0.80528269
MAT	NDVI	-0.20967449

The correlate() also supports the group_by() function syntax in the dplyr package.

Code

mf %>% 
  group_by(NLCD)  |> 
  dplyr::select(SOC, DEM, MAP, MAT, NDVI)  |> 
# then diagnose them
  dlookr::correlate() |> 
  flextable()

NLCD	var1	var2	coef_corr
Forest	DEM	SOC	0.30298598
Forest	MAP	SOC	0.48134776
Forest	MAT	SOC	-0.46258320
Forest	NDVI	SOC	0.39140504
Forest	SOC	DEM	0.30298598
Forest	MAP	DEM	0.41052474
Forest	MAT	DEM	-0.71735792
Forest	NDVI	DEM	0.39660074
Forest	SOC	MAP	0.48134776
Forest	DEM	MAP	0.41052474
Forest	MAT	MAP	-0.62794270
Forest	NDVI	MAP	0.63598331
Forest	SOC	MAT	-0.46258320
Forest	DEM	MAT	-0.71735792
Forest	MAP	MAT	-0.62794270
Forest	NDVI	MAT	-0.48149700
Forest	SOC	NDVI	0.39140504
Forest	DEM	NDVI	0.39660074
Forest	MAP	NDVI	0.63598331
Forest	MAT	NDVI	-0.48149700
Herbaceous	DEM	SOC	-0.22121181
Herbaceous	MAP	SOC	0.48276882
Herbaceous	MAT	SOC	-0.10104174
Herbaceous	NDVI	SOC	0.52751680
Herbaceous	SOC	DEM	-0.22121181
Herbaceous	MAP	DEM	-0.57025818
Herbaceous	MAT	DEM	-0.66962468
Herbaceous	NDVI	DEM	-0.59258641
Herbaceous	SOC	MAP	0.48276882
Herbaceous	DEM	MAP	-0.57025818
Herbaceous	MAT	MAP	0.35544711
Herbaceous	NDVI	MAP	0.87783385
Herbaceous	SOC	MAT	-0.10104174
Herbaceous	DEM	MAT	-0.66962468
Herbaceous	MAP	MAT	0.35544711
Herbaceous	NDVI	MAT	0.18936050
Herbaceous	SOC	NDVI	0.52751680
Herbaceous	DEM	NDVI	-0.59258641
Herbaceous	MAP	NDVI	0.87783385
Herbaceous	MAT	NDVI	0.18936050
Planted/Cultivated	DEM	SOC	-0.31880666
Planted/Cultivated	MAP	SOC	0.42971838
Planted/Cultivated	MAT	SOC	0.06276231
Planted/Cultivated	NDVI	SOC	0.47472055
Planted/Cultivated	SOC	DEM	-0.31880666
Planted/Cultivated	MAP	DEM	-0.88745536
Planted/Cultivated	MAT	DEM	-0.83305530
Planted/Cultivated	NDVI	DEM	-0.52616146
Planted/Cultivated	SOC	MAP	0.42971838
Planted/Cultivated	DEM	MAP	-0.88745536
Planted/Cultivated	MAT	MAP	0.62195665
Planted/Cultivated	NDVI	MAP	0.71441683
Planted/Cultivated	SOC	MAT	0.06276231
Planted/Cultivated	DEM	MAT	-0.83305530
Planted/Cultivated	MAP	MAT	0.62195665
Planted/Cultivated	NDVI	MAT	0.17053661
Planted/Cultivated	SOC	NDVI	0.47472055
Planted/Cultivated	DEM	NDVI	-0.52616146
Planted/Cultivated	MAP	NDVI	0.71441683
Planted/Cultivated	MAT	NDVI	0.17053661
Shrubland	DEM	SOC	0.39720120
Shrubland	MAP	SOC	0.44532165
Shrubland	MAT	SOC	-0.45936785
Shrubland	NDVI	SOC	0.64602818
Shrubland	SOC	DEM	0.39720120
Shrubland	MAP	DEM	0.29834522
Shrubland	MAT	DEM	-0.81913433
Shrubland	NDVI	DEM	0.48180780
Shrubland	SOC	MAP	0.44532165
Shrubland	DEM	MAP	0.29834522
Shrubland	MAT	MAP	-0.28844730
Shrubland	NDVI	MAP	0.70639285
Shrubland	SOC	MAT	-0.45936785
Shrubland	DEM	MAT	-0.81913433
Shrubland	MAP	MAT	-0.28844730
Shrubland	NDVI	MAT	-0.57829256
Shrubland	SOC	NDVI	0.64602818
Shrubland	DEM	NDVI	0.48180780
Shrubland	MAP	NDVI	0.70639285
Shrubland	MAT	NDVI	-0.57829256

Visualization of the Correlation Matrix

plot.correlate() visualizes the correlation matrix.

Code

mf |>  
  dplyr::select(SOC, DEM, MAP, MAT, NDVI) |> 
  # then diagnose them
  dlookr::correlate() |>  
  plot()

The plot.correlate() function also supports the group_by() function syntax in the dplyr package.

Code

mf |>  
  group_by(NLCD) %>%
  dplyr::select(SOC, DEM, MAP, MAT, NDVI) |> 
# then diagnose them
  dlookr::correlate() |>  
  plot()

EDA based on target variable

To perform EDA based on the target variable, you must create a target_by class object. target_by() creates a target_by class with an object inheriting data.frame.. target_by() is similar to group_by() in dplyr which creates grouped_df.

EDA when target variable is categorical variable

The following is an example of specifying VAR as the target variable in as_df data.frame.:

Code

target.cat<-target_by(as_df, VAR)

Cases where predictors are numeric variable:

relate() shows the relationship between the target variable and the predictor. The following example shows the relationship between GAs and the target variable VAR. The predictor GAs is a numeric variable. In this case, the descriptive statistics are shown for each level of the target variable.

Code

# If the variable of interest is a numerical variable
cat_num <- relate(target.cat, GAs)
cat_num

# A tibble: 8 × 27
  described_variables VAR           n    na  mean    sd se_mean   IQR skewness
  <chr>               <chr>     <int> <int> <dbl> <dbl>   <dbl> <dbl>    <dbl>
1 GAs                 BR01         20     0  1.41 0.485  0.108  0.671   0.387 
2 GAs                 BR06         20     0  1.27 0.484  0.108  0.603   0.619 
3 GAs                 BR28         20     0  1.35 0.516  0.115  0.737   0.278 
4 GAs                 BR35         20     0  1.37 0.454  0.101  0.763   0.194 
5 GAs                 BR36         20     0  1.37 0.526  0.118  0.860   0.476 
6 GAs                 Jefferson    20     0  1.32 0.478  0.107  0.620   0.624 
7 GAs                 Kaybonnet    20     0  1.43 0.546  0.122  0.895   0.0240
8 GAs                 total       140     0  1.36 0.491  0.0415 0.798   0.345 
# ℹ 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
#   p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
#   p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
#   p99 <dbl>, p100 <dbl>

plot() visualizes the relate class object created by relate() as the relationship between the target and predictor variables. The relationship between GAs and VAR visualized by a density plot.

Code

plot(cat_num)

Cases where predictors are categorical variable:

The following example shows the relationship between TREAT and the target variable VAR. The predictor variable TREAT is categorical. This case illustrates the contingency table of two variables. The summary() function performs an independence test on the contingency table.

Code

cat_cat <- relate(target.cat, TREAT)
summary(cat_cat)

Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
Number of cases in table: 140 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 0, df = 6, p-value = 1

plot() visualizes the relationship between the target variable and the predictor. A mosaics plot represents the relationship between VAR and TREAT.

Code

plot(cat_cat)

EDA when target variable is numerical variable

When the numeric variable GY is the target variable, we examine the relationship between the target variable and the predictor.

Code

target.num <- target_by(as_df, GY)

Cases where predictors are numeric variable:

The following example shows the relationship between GAs and the target variable GY. The predictor variable GAs is numeric. In this case, it shows the result of a simple linear model of the target ~ predictor formula. The ‘summary()’ function expresses the details of the model.

Code

num_num <- relate(target.num, GAs)
summary(num_num)


Call:
lm(formula = formula_str, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.308  -5.749  -0.707   6.801  34.242 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   50.983      2.697  18.901  < 2e-16 ***
GAs          -16.410      1.866  -8.792 5.23e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.8 on 138 degrees of freedom
Multiple R-squared:  0.3591,    Adjusted R-squared:  0.3544 
F-statistic: 77.31 on 1 and 138 DF,  p-value: 5.235e-15

plot() visualizes the relationship between the target and predictor variables. The relationship between GY and GAs is visualized with a scatter plot. The figure on the left shows the scatter plot of GY and GAs and the confidence interval of the regression line and regression line. The figure on the right shows the relationship between the original data and the predicted values of the linear model as a scatter plot. If there is a linear relationship between the two variables, the scatter plot of the observations converges on the red diagonal line.

Code

plot(num_num)

Cases where predictors are categorical variables:

The following example shows the relationship between VAR and the target The predictor VAR is a categorical variable and displays the result of a one-way ANOVA of the target ~ predictor relationship. The results are expressed in terms of ANOVA. The summary() function shows the regression coefficients for each level of the predictor. In other words, it shows detailed information about the simple regression analysis of the target ~ predictor relationship.

Code

#num_cat <- relate(target.num, VAR)
#summary(num_cat)

Data Transformation

dlookr imputes missing values and outliers and resolves skewed data. It also provides the ability to bin continuous variables as categorical variables.

Here is a list of the data conversion functions and functions provided by dlookr:

find_na() finds a variable that contains the missing values variable, and imputate_na() imputes the missing values.
find_outliers() finds a variable that contains the outliers, and imputate_outlier() imputes the outlier.
summary.imputation() and plot.imputation() provide information and visualization of the imputed variables.
find_skewness() finds the variables of the skewed data, and transform() performs the resolving of the skewed data.
transform() also performs standardization of numeric variables.
summary.transform() and plot.transform() provide information and visualization of transformed variables.
binning() and binning_by() convert binational data into categorical data.
print.bins() and summary.bins() show and summarize the binning results.
plot.bins() and plot.optimal_bins() provide visualization of the binning result.
transformation_report() performs the data transform and reports the result.

Imputation of missing values

imputes the missing value with `imputate_na()`

imputate_na() imputes the missing value contained in the variable. The predictor with missing values supports both numeric and categorical variables and supports the following method.

predictor is numerical variable
- “mean”: arithmetic mean
- “median”: median
- “mode”: mode
- “knn”: K-nearest neighbors
  - target variable must be specified
- “rpart”: Recursive Partitioning and Regression Trees
  - target variable must be specified
- “mice”: Multivariate Imputation by Chained Equations
  - target variable must be specified
  - random seed must be set
predictor is categorical variable
- “mode”: mode
- “rpart”: Recursive Partitioning and Regression Trees
  - target variable must be specified
- “mice”: Multivariate Imputation by Chained Equations
  - target variable must be specified
  - random seed must be set

In the following example, imputate_na() imputes the missing value of SOC, a numeric variable of mf dataframe, using the “rpart” method. summary() summarizes missing value imputation information, and plot() visualizes missing information.

Code

soc <- imputate_na(mf, SOC, STATE, method = "rpart")
summary(soc)

* Impute missing values based on Recursive Partitioning and Regression Trees
 - method : rpart

* Information of Imputation (before vs after)
                    Original    Imputation 
described_variables "value"     "value"    
n                   "467"       "471"      
na                  "4"         "0"        
mean                "6.350762"  "6.327249" 
sd                  "5.045409"  "5.031445" 
se_mean             "0.2334737" "0.2318367"
IQR                 "5.9440"    "5.8595"   
skewness            "1.464728"  "1.476371" 
kurtosis            "2.427192"  "2.469712" 
p00                 "0.408"     "0.408"    
p01                 "0.49094"   "0.49130"  
p05                 "0.9637"    "0.9735"   
p10                 "1.2902"    "1.2930"   
p20                 "2.3294"    "2.3350"   
p25                 "2.7695"    "2.7775"   
p30                 "3.1114"    "3.0940"   
p40                 "3.9906"    "3.9740"   
p50                 "4.971"     "4.953"    
p60                 "6.1266"    "6.0980"   
p70                 "7.503"     "7.438"    
p75                 "8.7135"    "8.6370"   
p80                 "10.0522"   " 9.9570"  
p90                 "13.383"    "13.351"   
p95                 "16.5219"   "16.5185"  
p99                 "22.12476"  "22.06820" 
p100                "30.473"    "30.473"

Code

# viz of imputation
plot(soc)

Standardization and Resolving Skewness

Introduction to the use of `transform()`

transform() performs data transformation. Only numeric variables are supported, and the following methods are provided.

Standardization
- “zscore”: z-score transformation. (x - mu) / sigma
- “minmax”: minmax transformation. (x - min) / (max - min)
Resolving Skewness
- “log”: log transformation. log(x)
- “log+1”: log transformation. log(x + 1). Used for values that contain 0.
- “sqrt”: square root transformation.
- “1/x”: 1 / x transformation
- “x^2”: x square transformation
- “x^3”: x^3 square transformation

Standardization with transform()

Use the methods zscore and minmax to perform standardization.

Code

mf |>  
  dplyr::mutate(soc_minmax = transform(SOC, method = "minmax"),
  ndvi_minmax = transform(NDVI, method = "minmax")) |>  
  select(soc_minmax, ndvi_minmax) |> 
  boxplot()

Resolving Skewness data with `transform()`

find_skewness() searches for variables with skewed data. This function finds data skewed by search conditions and calculates skewness.

Code

dlookr::find_skewness(mf)

[1]  6  8 10 11 12 13 14

Code

# compute the skewness
find_skewness(mf, value = TRUE)

       ID      FIPS  STATE_ID Longitude  Latitude       SOC       DEM    Aspect 
   -0.008     0.326     0.327     0.750    -0.112     1.460    -0.023     0.533 
    Slope       TPI   KFactor       MAP       MAT      NDVI  SiltClay 
    1.627    -1.084    -0.540     1.079    -0.274     0.233     0.110

skewness of SOC is 1.46. This means that the distribution of data is inclined to the left. So, for normal distribution, use transform() to convert to the “log” method as follows. summary() summarizes transformation information, and plot() visualizes transformation information.

Code

soc_log = transform(mf$SOC, method = "log")
summary(soc_log)

* Resolving Skewness with log

* Information of Transformation (before vs after)
            Original Transformation
n        467.0000000   467.00000000
na         4.0000000     4.00000000
mean       6.3507623     1.51955339
sd         5.0454091     0.87144092
se_mean    0.2334737     0.04032548
IQR        5.9440000     1.14617777
skewness   1.4647284    -0.45555425
kurtosis   2.4271923    -0.16754924
p00        0.4080000    -0.89648810
p01        0.4909400    -0.71147121
p05        0.9637000    -0.03724314
p10        1.2902000     0.25479371
p20        2.3294000     0.84561000
p25        2.7695000     1.01866665
p30        3.1114000     1.13507226
p40        3.9906000     1.38393888
p50        4.9710000     1.60362103
p60        6.1266000     1.81263983
p70        7.5030000     2.01530262
p75        8.7135000     2.16484442
p80       10.0522000     2.30778025
p90       13.3830000     2.59398096
p95       16.5219000     2.80468666
p99       22.1247600     3.09624501
p100      30.4730000     3.41684105

Code

# viz of transformation
plot(soc_log)

Binning

Binning of individual variables using `binning()`

binning() transforms a numeric variable into a categorical variable by binning it. The following types of binning are supported.

“quantile”: categorize using quantile to include the same frequencies
“equal”: categorize to have equal length segments
“pretty”: categorized into moderately good segments
“kmeans”: categorization using K-means clustering
“bclust”: categorization using bagged clustering technique

Here are some examples of how to bin SOC using binning().:

Code

# Binning the SOC variable. default type argument is "quantile"
bin <- binning(mf$SOC)
# Print bins class object
bin

binned type: quantile
number of bins: 10
x
   [0.408,1.286467]   (1.286467,2.3232]   (2.3232,3.109267] (3.109267,3.988067] 
                 47                  46                  47                  47 
   (3.988067,4.971]      (4.971,6.1274]      (6.1274,7.507]      (7.507,10.076] 
                 47                  46                  47                  48 
  (10.076,13.42567]   (13.42567,30.473]                <NA> 
                 45                  47                   4

Code

# Summarize bins class object
summary(bin)

                levels freq        rate
1     [0.408,1.286467]   47 0.099787686
2    (1.286467,2.3232]   46 0.097664544
3    (2.3232,3.109267]   47 0.099787686
4  (3.109267,3.988067]   47 0.099787686
5     (3.988067,4.971]   47 0.099787686
6       (4.971,6.1274]   46 0.097664544
7       (6.1274,7.507]   47 0.099787686
8       (7.507,10.076]   48 0.101910828
9    (10.076,13.42567]   45 0.095541401
10   (13.42567,30.473]   47 0.099787686
11                <NA>    4 0.008492569

Code

plot(bin)

Summary and Conclusions

This tutorial explores exploratory data analysis (EDA) using the R-package ‘dlookr.’ You’ll learn about the package’s features, such as missing data analysis and outlier detection, and functions that facilitate the generation of insightful reports and visualizations. Additionally, you’ll learn how ‘dlookr’ contributes to a comprehensive understanding of data quality. Remember to consider exploring its additional functionalities, practice with diverse datasets, and adapt the techniques to the unique characteristics of your data.

Data Exploration with {dlookr} in R

Introduction

Usage

Installation

Check and Install Required Packages

Verify Installation

Load Packages

Check Loaded Packages

Data

Data Quality Diagnosis

General diagnosis of all variables

Diagnosis of Numeric Variables

Diagnosis of Categorical Variables

Diagnosing Outliers

Visualization of Outliers

Daignosis Normality

Normality Test

Visualization of Normality

Descriptive Statistics

Correlation Analysis

Visualization of the Correlation Matrix

EDA based on target variable

EDA when target variable is categorical variable

Cases where predictors are numeric variable:

Cases where predictors are categorical variable:

EDA when target variable is numerical variable

Cases where predictors are numeric variable:

Cases where predictors are categorical variables:

Data Transformation

Imputation of missing values

imputes the missing value with imputate_na()

Standardization and Resolving Skewness

Introduction to the use of transform()

Standardization with transform()

Resolving Skewness data with transform()

Binning

Binning of individual variables using binning()

Summary and Conclusions

References

imputes the missing value with `imputate_na()`

Introduction to the use of `transform()`

Resolving Skewness data with `transform()`

Binning of individual variables using `binning()`