Basic Data Exploration and Visualization in R

This tutorial will provide a comprehensive understanding of data exploration and visualization in R. It will cover the fundamental concepts, techniques, and tools used in this field. You will learn how to import data into R, clean and preprocess it, and create visualizations that allow you to explore and understand your data meaningfully. Additionally, you will discover how to manipulate and transform data, conduct statistical analyses, and interpret your results. Whether you are a beginner or an experienced data analyst, this tutorial will equip you with the skills and knowledge necessary to explore and visualize your data using R effectively.

Check and Install Required Packages

In these exercise we will use following R-Packages:

tydyverse: The tidyverse is a collection of R packages designed for data science.

data.table: data.table provides a high-performance version of base R’s data.frame

plyr: plyr is a set of tools for a common set of problems

flextable: The flextable package provides a framework for easily create tables for reporting and publications.

gtsummary: provides an elegant and flexible way to create publication-ready analytical and summary tables.

report: report’s primary goal is to bridge the gap between R’s output and the formatted results contained in your manuscript.

nortest: Five omnibus tests for testing the composite hypothesis of normality

outliers: collection of some tests commonly used for identifying outliers.

moments:Functions to calculate: moments, Pearson’s kurtosis, Geary’s kurtosis and skewness; tests related to them (Anscombe-Glynn, D’Agostino, Bonett-Seier).

car: Companion to Applied Regression

Hmisc: Contains many functions useful for data analysis, high-level graphics, utility operations,

corrplot:R package corrplot provides a visual exploratory tool on correlation matrix

ggstatsplot: ggplot2 based lots with statistical details

GGally: GGally extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data.

ggExtra:Add marginal histograms to ggplot2

ggside: The package ggside was designed to enable users to add metadata to their ggplots with ease.

RColorBrewer:Provides color schemes for maps (and other graphics)

patchwork:The goal of patchwork is to make it ridiculously simple to combine separate ggplots into the same graphic.

gridExtra:The grid package provides low-level functions to create graphical objects (grobs), and position them on a page in specific viewports.

rstatix: Provides a simple and intuitive pipe-friendly framework, coherent with the ‘tidyverse’ design philosophy, for performing basic statistical tests, including t-test, Wilcoxon test, ANOVA, Kruskal-Wallis and correlation analyses.

Code

packages <- c('tidyverse', 
         'data.table',
         'plyr',
         'flextable', 
         'gtsummary',
         'rstatix',
         'nortest',
         'outliers',
         'moments',
         'car',
         'Hmisc',
         'corrplot',
         'ggstatsplot',
         'GGally',
         'ggExtra',
         'ggside',
         'RColorBrewer', 
         'patchwork',
         'gridExtra',
         'report',
         'patchwork'
         )

#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Verify installation

Code

# Verify installation
cat("Installed packages:\n")

Installed packages:

Code

print(sapply(packages, requireNamespace, quietly = TRUE))

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Registered S3 method overwritten by 'ggside':
  method from  
  +.gg   GGally

   tidyverse   data.table         plyr    flextable    gtsummary      rstatix 
        TRUE         TRUE         TRUE         TRUE         TRUE         TRUE 
     nortest     outliers      moments          car        Hmisc     corrplot 
        TRUE         TRUE         TRUE         TRUE         TRUE         TRUE 
 ggstatsplot       GGally      ggExtra       ggside RColorBrewer    patchwork 
        TRUE         TRUE         TRUE         TRUE         TRUE         TRUE 
   gridExtra       report    patchwork 
        TRUE         TRUE         TRUE

Load Packages

Code

# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Check loaded packages

Code

# Check loaded packages
cat("Successfully loaded packages:\n")

Successfully loaded packages:

Code

print(search()[grepl("package:", search())])

 [1] "package:report"       "package:gridExtra"    "package:patchwork"   
 [4] "package:RColorBrewer" "package:ggside"       "package:ggExtra"     
 [7] "package:GGally"       "package:ggstatsplot"  "package:corrplot"    
[10] "package:Hmisc"        "package:car"          "package:carData"     
[13] "package:moments"      "package:outliers"     "package:nortest"     
[16] "package:rstatix"      "package:gtsummary"    "package:flextable"   
[19] "package:plyr"         "package:data.table"   "package:lubridate"   
[22] "package:forcats"      "package:stringr"      "package:dplyr"       
[25] "package:purrr"        "package:readr"        "package:tidyr"       
[28] "package:tibble"       "package:ggplot2"      "package:tidyverse"   
[31] "package:stats"        "package:graphics"     "package:grDevices"   
[34] "package:utils"        "package:datasets"     "package:methods"     
[37] "package:base"

Data

The data set use in this exercise can be downloaded from my Dropbox or from my Github account.

We will use read_csv() function of {readr} package to import data as a tidy data.

Code

mf.na<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/gp_soil_data_na.csv")
as_df<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/rice_arsenic_data.csv")

Data Structure

In R, the str() function is a useful tool for examining the structure of an object. It provides a compact and informative description of the object’s internal structure, including its type, length, dimensions, and contents.

Code

str(mf.na)

spc_tbl_ [471 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID       : num [1:471] 1 2 3 4 5 6 7 8 9 10 ...
 $ FIPS     : num [1:471] 56041 56023 56039 56039 56029 ...
 $ STATE_ID : num [1:471] 56 56 56 56 56 56 56 56 56 56 ...
 $ STATE    : chr [1:471] "Wyoming" "Wyoming" "Wyoming" "Wyoming" ...
 $ COUNTY   : chr [1:471] "Uinta County" "Lincoln County" "Teton County" "Teton County" ...
 $ Longitude: num [1:471] -111 -111 -111 -111 -111 ...
 $ Latitude : num [1:471] 41.1 42.9 44.5 44.4 44.8 ...
 $ SOC      : num [1:471] 15.8 15.9 18.1 10.7 10.5 ...
 $ DEM      : num [1:471] 2229 1889 2423 2484 2396 ...
 $ Aspect   : num [1:471] 159 157 169 198 201 ...
 $ Slope    : num [1:471] 5.67 8.91 4.77 7.12 7.95 ...
 $ TPI      : num [1:471] -0.0857 4.5591 2.6059 5.1469 3.7557 ...
 $ KFactor  : num [1:471] 0.32 0.261 0.216 0.182 0.126 ...
 $ MAP      : num [1:471] 468 536 860 869 803 ...
 $ MAT      : num [1:471] 4.595 3.86 0.886 0.471 0.759 ...
 $ NDVI     : num [1:471] 0.414 0.694 0.547 0.619 0.584 ...
 $ SiltClay : num [1:471] 64.8 72 57.2 55 51.2 ...
 $ NLCD     : chr [1:471] "Shrubland" "Shrubland" "Forest" "Forest" ...
 $ FRG      : chr [1:471] "Fire Regime Group IV" "Fire Regime Group IV" "Fire Regime Group V" "Fire Regime Group V" ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   FIPS = col_double(),
  ..   STATE_ID = col_double(),
  ..   STATE = col_character(),
  ..   COUNTY = col_character(),
  ..   Longitude = col_double(),
  ..   Latitude = col_double(),
  ..   SOC = col_double(),
  ..   DEM = col_double(),
  ..   Aspect = col_double(),
  ..   Slope = col_double(),
  ..   TPI = col_double(),
  ..   KFactor = col_double(),
  ..   MAP = col_double(),
  ..   MAT = col_double(),
  ..   NDVI = col_double(),
  ..   SiltClay = col_double(),
  ..   NLCD = col_character(),
  ..   FRG = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

Missing values

Missing values in a dataset can significantly impact the accuracy and reliability of data analysis. Therefore, dealing with missing values is a crucial and challenging task in data analysis.

In R, the is.na() function is used to test for missing values (NA) in an object. It returns a logical vector indicating which elements of the object are missing.

Code

sum(is.na(mf.na))

[1] 4

Return the column names containing missing observations

Code

list_na <- colnames(mf.na)[ apply(mf.na, 2, anyNA) ]
list_na

[1] "SOC"

Dealing with missing values

Several methods are available to handle missing values in data, including imputation methods such as mean imputation, median imputation, mode imputation, and regression imputation. Another approach is removing the rows or columns containing missing values, known as deletion methods. The choice of method depends on the specific dataset and the analysis goals. It’s essential to consider each approach carefully, as each method has advantages and disadvantages. Furthermore, the method chosen should be able to handle missing values without introducing bias or losing essential information. Therefore, assessing the quality of imputed values and the impact of missing values on the analysis outcome is essential.

Here are some common strategies for handling missing data:

Delete the rows or columns with missing data: This is the simplest approach, but it can result in the loss of valuable information if there are many missing values.
Imputation: Imputation involves filling in the missing values with estimated values based on other data points. Standard imputation techniques include mean imputation, median imputation, and regression imputation.
ultiple imputation: This involves creating several imputed datasets and analyzing each separately, then combining the results to obtain a final estimate.
Modeling: This involves building a predictive model that can estimate the missing values based on other variables in the dataset.
Domain-specific knowledge: Sometimes, domain-specific kno In this exercise, we only show to remove missing values in data-frame. Later we will show you how to impute these missing values with machine learning.

Remove missing values

An shorthand alternative is to simply use na.omit() to omit all rows containing missing values.

We can remove missing values from a data frame using the drop_na() function. This function removes any rows that contain missing values (NA values) in any column.

Here’s an example to remove missing values:

Code

mf<-mf.na  |>  tidyr::drop_na()
sum(is.na(mf))

[1] 0

Impute missing values

Essay way to impute missing values with mean or median values of the observation

Code

list_na <- colnames(mf.na)[ apply(mf.na, 2, anyNA) ]
median_missing <- apply(mf.na[,colnames(mf.na) %in% list_na],
      2,   median,
      na.rm =  TRUE)
mf.imp<- mf.na  |>  
        mutate(SOC.new  = ifelse(is.na(SOC), median_missing[1], SOC))
head(mf.imp)

# A tibble: 6 × 20
     ID  FIPS STATE_ID STATE  COUNTY Longitude Latitude   SOC   DEM Aspect Slope
  <dbl> <dbl>    <dbl> <chr>  <chr>      <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl>
1     1 56041       56 Wyomi… Uinta…     -111.     41.1  15.8 2229.   159.  5.67
2     2 56023       56 Wyomi… Linco…     -111.     42.9  15.9 1889.   157.  8.91
3     3 56039       56 Wyomi… Teton…     -111.     44.5  18.1 2423.   169.  4.77
4     4 56039       56 Wyomi… Teton…     -111.     44.4  10.7 2484.   198.  7.12
5     5 56029       56 Wyomi… Park …     -111.     44.8  10.5 2396.   201.  7.95
6     6 56039       56 Wyomi… Teton…     -111.     44.1  17.0 2361.   209.  9.66
# ℹ 9 more variables: TPI <dbl>, KFactor <dbl>, MAP <dbl>, MAT <dbl>,
#   NDVI <dbl>, SiltClay <dbl>, NLCD <chr>, FRG <chr>, SOC.new <dbl>

Visual Inspection of Data Distribution

Visual inspection of data distribution is an essential step in exploratory data analysis (EDA). It involves creating visualizations to understand the underlying patterns, central tendencies, and spread of the data. Several common techniques for visually inspecting data distribution include:

Histogram
Kernel density Plots
Quantile-Quantile Plots (qq-plot)

Histograms

A histogram is a graphical representation of the distribution of numerical data. They are commonly used in statistics to show the frequency distribution of a data set for identifying patterns, outliers, and skewness in the data. Histograms are also helpful in visualizing the shape of a distribution, such as whether it is symmetric or skewed, and whether it has one or more peaks.

In R, hist() function is used to create histograms of numerical data. It takes one or more numeric vectors as input and displays a graphical representation of the distribution of the data in the form of a histogram.

We can also create histograms using the ggplot2 package, which is a powerful data visualization tool based on The Grammar of Graphics.

The ggplot2 allows you to create complex and customizable graphics using a layered approach, which involves building a plot by adding different layers of visual elements. It comes with tidyverse package.

R-base function hist()
ggplot2 geom_histogram()

Code

hist(mf$SOC, 
     # plot title
     main = "Histogram of Soil SOC",
     # x-axis title
     xlab= "Soil SOC (mg/g)",
     # y-axis title
     ylab= "Frequency")

Code

ggplot(mf, aes(SOC)) +
  geom_histogram()+
  # X-axis title
  xlab("SOC (mg/g)") + 
  # y-axis title
  ylab("Frequency")+
  # plot title
  ggtitle("Histogram of Soil SOC")

Kernel Density Plots

A kernel density plot (or kernel density estimation plot) is a non-parametric way of visualizing the probability density function of a variable. It represents a smoothed histogram version, where the density estimate is obtained by convolving the data with a kernel function. The kernel density plot is a helpful tool for exploring data distribution, identifying outliers, and comparing distributions.

In practice, kernel density estimation involves selecting a smoothing parameter (known as the bandwidth) that determines the width of the kernel function. The choice of bandwidth can significantly impact the resulting density estimate, and various methods have been developed to select the optimal bandwidth for a given dataset.

To create a density plot in R, you can use the density() function to estimate the density of a variable, and then plot it using the plot() function. If you want to use ggplot2, you can create a density plot with the geom_density() function, like this:

R-base function density()
ggplot2 geom_density()

Code

# estimate the density
p<-density(mf$SOC, na.rm = TRUE)
# plot density
plot(p, 
    # plot title
     main = "Kernel Density of Soil SOC",
     # x-axis tittle
     xlab= "Soil SOC (mg/kg)",
     # y-axis title
     ylab= "Density")

Code

ggplot(mf, aes(SOC)) +
  geom_density()+
  # x-axis title
  xlab("SOC (mg/g)") + 
  # y-axis title
  ylab("Density")+
  # plot title
  ggtitle("Kernel Density of Soil SOC")+
    theme(
    # Center the plot title
    plot.title = element_text(hjust = 0.5))

QQ-Plot

A Q-Q plot (quantile-quantile plot) is a graphical method used to compare two probability distributions by plotting their quantiles against each other. The two distributions being compared are first sorted in ascending order to create a Q-Q plot. Then, the corresponding quantile in the other distribution is calculated for each data point in one distribution. The resulting pairs of quantiles are plotted on a scatter plot, with one distribution’s quantiles on the x-axis and the other on the y-axis. If the two distributions being compared are identical, the points will fall along a straight line.

Note

Probability distribution is a function that describes the likelihood of obtaining different outcomes in a random event. It gives a list of all possible outcomes and their corresponding probabilities. The probabilities assigned to each outcome must add up to 1, as one of the possible outcomes will always occur.

In R, you can create a Q-Q plot using the two functions:

qqnorm() is a generic function the default method of which produces a normal QQ plot of the values in y.

qqline() adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.

We can create a QQ plot using the ggplot() and stat_qq() functions and then use stat_qq_line() function to add a line indicating the expected values under the assumption of a normal distribution.

R-base function: QQ-plot
ggplot2 QQ-Plot

Code

# draw normal QQ plot
qqnorm(mf$SOC)
# Add reference or theoretical line
qqline(mf$SOC,
      main = "Q-Q plot of Soil SOC from a Normal Distribution",
     # x-axis tittle
     xlab= "Theoretical Quantiles",
     # y-axis title
     ylab= "Sample Quantiles")

Code

ggplot(mf, aes(sample = SOC)) +
  stat_qq() + 
  stat_qq_line() +
  # x-axis title
  xlab("Theoretical Quantiles)") + 
  # y-axis title
  ylab("Sample Quantiles")+
  # plot title
  ggtitle("Q-Q plot of Soil SOC from a Normal Distribution")+
    theme(
    # Center the plot title
    plot.title = element_text(hjust = 0.5),
    # x-axis title font size
    axis.title.x = element_text(size = 14),
    # y-axis title font size
    axis.title.y = element_text(size = 14),)

Normality Test

In statistics and data analysis, normality tests are commonly used to determine whether a given data sample is derived from a normally distributed population. A normal distribution is a bell-shaped curve that represents a set of data in which the majority of the observations fall near the mean, with fewer observations at the tails of the distribution. Normality tests provide us with evidence for or against the assumption that the data is normally distributed, but they do not prove normality. Therefore, it is crucial to visually inspect the data and consider the context in which it was collected before assuming normality, even if the data passes a normality test. By doing so, we can ensure that we have a comprehensive understanding of the data and avoid any misleading conclusions.

Statistical tests for normality include:

Shapiro-Wilk test: The Shapiro-Wilk test is a commonly used test to check for normality. The test calculates a W statistic. If the p-value associated with the test is greater than a chosen significance level (usually 0.05), it suggests that the data is normally distributed.

Anderson-Darling test: The Anderson-Darling test is another test that can be used to check for normality. The test calculates an A2 statistic, and if the p-value associated with the test is greater than a chosen significance level, it suggests that the data is normally distributed

Kolmogorov-Smirnov test:The Kolmogorov-Smirnov (KS) test is a statistical test used to determine if two datasets have the same distribution. The test compares the empirical cumulative distribution function (ECDF) of the two datasets and calculates the maximum vertical distance between the two ECDFs. The KS test statistic is the maximum distance and the p-value is calculated based on the distribution of the test statistic.

The shapiro.test() function can be used to perform the Shapiro-Wilk test for normality. The function takes a numeric vector as input and returns a list containing the test statistic and p-value.

Code

shapiro.test(mf$SOC)


    Shapiro-Wilk normality test

data:  mf$SOC
W = 0.87237, p-value < 2.2e-16

Since the p-value is lower than 0.05, we can conclude that the data is not normally distributed.

The ad.test() function from the nortest package can be used to perform the Anderson-Darling test for normality. The function takes a numeric vector as input and returns a list containing the test statistic and p-value. It is important to note that the Anderson-Darling test is also sensitive to sample size, and may not be accurate for small sample sizes. Additionally, the test may not be appropriate for non-normal distributions with heavy tails or skewness. Therefore, it is recommended to use the test in conjunction with visual inspection of the data to determine if it follows a normal distribution.

install.package(nortest)

Code

#library(nortest)
nortest::ad.test(mf$SOC)


    Anderson-Darling normality test

data:  mf$SOC
A = 16.22, p-value < 2.2e-16

Since the p-value is lower than 0.05, we can conclude that the data is not normally distributed.

The ks.test() function can be used to perform the Kolmogorov-Smirnov test for normality. The function takes a numeric vector as input and returns a list containing the test statistic and p-value. pnorm specifies the cumulative distribution function to use as the reference distribution for the test.

Code

ks.test(mf$SOC, "pnorm")


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  mf$SOC
D = 0.81199, p-value < 2.2e-16
alternative hypothesis: two-sided

The test result shows that the maximum distance between the two ECDFs is 0.82 and the p-value is very small, which means that we can reject the null hypothesis that the two samples are from the same distribution.

Skewness and Kurtosis

Skewness is a measure of the asymmetry of a probability distribution. In other words, it measures how much a distribution deviates from symmetry or normal distribution. A positive skewness indicates that the distribution has a longer right tail, while a negative skewness indicates that the distribution has a longer left tail. A skewness of zero indicates a perfectly symmetric distribution.

Kurtosis measures how much a distribution deviates from a normal distribution in terms of the concentration of scores around the mean. A positive kurtosis indicates a more peaked distribution than a normal distribution, while a negative kurtosis indicates a flatter distribution than a normal distribution. A kurtosis of zero indicates a normal distribution.

we will use skewness() and kurtosis() functions from the moments library in R:

install.package(“moments”)

Skewness
kurtosis

Code

#library(moments)
moments::skewness(mf$SOC, na.rm = TRUE)

[1] 1.460019

High positive value indicates that the distribution is highly skewed at right-hand sight, which means that in some sites soil are highly contaminated with SOC.

Code

#library(moments)
moments::kurtosis(mf$SOC, na.rm = TRUE)

[1] 5.388462

Again, high positive kurtosis value indicates that Soil SOC is not normal distributed

Data Transformation for Normality

Data transformation is a statistical technique that involves modifying the structure of a dataset to improve its usefulness for analysis. This method involves adjusting the scale or shape of the data distribution to make it more appropriate for statistical analysis. The goal of data transformation is often to create a more normal distribution of the data, which can make it easier to apply many common statistical models and methods. By transforming data, researchers can gain a better understanding of their data and make more accurate inferences about the underlying population.

Here are some common data transformations used to achieve normality:

Logarithmic transformation: If the data are skewed to the right, taking the data’s logarithm can help reduce the skewness and make the distribution more normal.

\(log10(x)\) for positively skewed data,
\(log10(max(x+1) - x)\) for negatively skewed data

Square root transformation: Similar to logarithmic transformation, taking the square root of the data can help reduce the distribution’s skewness.

\(sqrt(x)\) for positively skewed data,
\(sqrt(max(x+1) - x)\) for negatively skewed data

Inverse transformation: If the data are skewed to the left, taking the inverse of the data (1/x) can help to reduce the skewness and make the distribution more normal.
Box-Cox transformation: The Box-Cox transformation is a method of data transformation that is used to stabilize the variance of a dataset and make it more normally distributed. The Box-Cox transformation involves applying a power transformation to the data, which is determined by a parameter λ.

The Box-Cox transformation is defined by a power transformation:

\[ y(\lambda) = \begin{cases} \dfrac{{y^\lambda - 1}}{\lambda} & \text{if } \lambda \neq 0 \\ \ln(y) & \text{if } \lambda = 0 \end{cases} \]

Here, \(y\) represents the original data, and \(\lambda\) is a parameter that controls the transformation. The Box-Cox transformation aims to find the value of \(\lambda\) that maximizes the normality and homoscedasticity of the transformed data.

When choosing a data transformation method, it is important to consider the type of data, the purpose of the analysis, and the assumptions of the statistical method being used. It is also important to check the normality of the transformed data using visual inspection (histogram) and statistical tests, such as the Shapiro-Wilk test or the Anderson-Darling test.

Code

# Log10 transformation
mf$log_SOC<-log10(mf$SOC)
# Histogram
hist(mf$log_SOC,
          # plot title
     main = "Histogram of Log Soil SOC",
     # x-axis title
     xlab= "Log Soil SOC",
     # y-axis title
     ylab= "Frequency")

Code

# Shapiro-Wilk test
shapiro.test(mf$log_SOC)


    Shapiro-Wilk normality test

data:  mf$log_SOC
W = 0.98056, p-value = 6.733e-06

Code

# Square root transformation
mf$sqrt_SOC<-sqrt(mf$SOC)
# histogram
hist(mf$sqrt_SOC,
          # plot title
     main = "Histogram of Square root Soil SOC",
     # x-axis title
     xlab= "Sqrt. Soil SOC",
     # y-axis title
     ylab= "Frequency")

Code

# Shapiro-Wilk test
shapiro.test(mf$sqrt_SOC)


    Shapiro-Wilk normality test

data:  mf$sqrt_SOC
W = 0.97494, p-value = 3.505e-07

Code

# Inverse transformation
mf$inv_SOC<-(1/mf$SOC)
# histogram
hist(mf$inv_SOC,
          # plot title
     main = "Histogram of Inverse of Soil SOC",
     # x-axis title
     xlab= "Inverse of Soil SOC",
     # y-axis title
     ylab= "Frequency")

Code

# Shapiro-Wilk test
shapiro.test(mf$inv_SOC)


    Shapiro-Wilk normality test

data:  mf$inv_SOC
W = 0.66851, p-value < 2.2e-16

First we have to calculate appropriate transformation parameters using powerTransform() function of car package and then use this parameter to transform the data using bcPower() function.

install.package(“car”)

Code

#library(car)
# Get appropriate power - lambda
power<-car::powerTransform(mf$SOC)
power$lambda

   mf$SOC 
0.2072527

Code

# power transformation
mf$bc_SOC<-car::bcPower(mf$SOC, power$lambda)
# histogram
hist(mf$bc_SOC,
          # plot title
     main = "Histogram of BoxCox Soil SOC",
     # x-axis title
     xlab= "BoxCox of Soil SOC",
     # y-axis title
     ylab= "Frequency")

Code

# Shapiro-Wilk test
shapiro.test(mf$bc_SOC)


    Shapiro-Wilk normality test

data:  mf$bc_SOC
W = 0.99393, p-value = 0.05912

Outliers

Outliers are data points that have values significantly different from the other data points in a dataset. They can be caused by various factors, such as measurement errors, experimental errors, or rare extreme values. Outliers can have a significant impact on the analysis of a dataset, as they can skew the results and lead to inaccurate conclusions.

There are various methods for detecting outliers, such as statistical methods, visual analysis, or machine learning algorithms. However, outlier detection is not always straightforward, and different methods may identify different outliers. Therefore, it’s crucial to understand the nature of the data and the context in which it was collected to choose the most appropriate method for outlier detection. Moreover, it’s essential to determine whether the outliers are genuine data points or errors. In some cases, outliers may be valid and provide valuable insights into the underlying patterns or phenomena. In other cases, outliers may be artifacts of the data collection process or measurement errors and need to be removed or corrected.

In summary, outlier detection is an important step in data analysis, but it requires careful consideration and judgement to ensure that the results are reliable and accurate.

In this exercise we will discuss some common methods of detecting outleirs, but we will not removing them. Here are some common methods for identifying outliers:

Visual inspection with Boxplot

Plotting the data and visually inspecting it can often reveal outliers. Box plot, also known as a box-and-whisker plot is useful visualizations for this purpose. It is a graphical representation of a dataset that displays its median, quartiles, and any potential outliers. Here’s how to read a boxplot: Boxplots help identifies a dataset’s distribution, spread, and potential outliers. They are often used in exploratory data analysis to gain insights into the characteristics of the data.

• The box represents the interquartile range (IQR), which is the range between the 25th and 75th percentiles of the data. The length of the box represents the range of the middle 50% of the data.

• The line inside the box represents the median of the data, which is the middle value when the data is sorted.

• The whiskers extend from the box and represent the range of the data outside the IQR. They can be calculated in different ways depending on the method used to draw the boxplot. Some common methods include:

Tukey’s method: The whiskers extend to the most extreme data point within 1.5 times the IQR of the box.
Min/max whiskers: The whiskers extend to the minimum and maximum values in the data that are not considered outliers.
5-number summary: The whiskers extend to the minimum and maximum values in the data, excluding outliers.

• Outliers are data points that fall outside the whiskers of the boxplot. They are represented as individual points on the plot.

R-base function boxplot()
ggplot2 geom_boxplot()

The base R allows you to create boxplots using the boxplot() function:

Code

boxplot(mf$SOC, 
     # plot title
     main = "Boxplot of Soil SOC",
     # x-axis title
     #xlab= "Soil SOC (mg/kg)",
     # y-axis title
     ylab= "Soil SOC (mg/kg)")

ggplot2 also create boxplot with ggplot() and geom_boxplot() functions. You can customize the plot further by modifying labels, colors, themes, and adding more layers to the plot using ggplot2 functions and additional aesthetic mappings (aes())

Code

ggplot(mf, aes(SOC))+   geom_boxplot()+
  coord_flip()+
  # X-axis title
  xlab("Soil SOC (mg/g)") + 
  # y-axis title
  # ylab("Soil SOC (mg/kg)")+
  # plot title
  ggtitle("Boxplot of Soil SOC") +
    theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    # customize axis ticks 
    axis.text.y=element_text(size=10,
                             angle = 90,
                             vjust = 0.5, 
                             hjust=0.5, colour='black'))

Now check how many observation considered as outlier:

Code

length(boxplot.stats(mf$SOC)$out)

[1] 20

Box-Jitter plot

A box-jitter plot is a type of data visualization that combines elements of both a boxplot and a jitter plot. It is useful for displaying both the distribution and the individual data points of a dataset.

The jitter component of the plot is used to display the individual data points. Jittering refers to adding a small amount of random noise to the position of each data point along the x-axis, to prevent overlapping and to give a better sense of the density of the data.

Overall, the box-jitter plot provides a useful summary of the distribution of a dataset, while also allowing viewers to see the individual data points and any potential outliers. It is particularly useful for displaying data with a large number of observations or for comparing multiple distributions side by side.

We will use ggplot package to create Box-Jitter plots to explore variability of Soil SOC among the Landcover classes. We will create a custom color palette using colorRampPalette() function of RColorBrewer package.

install.packages(“RColorBrewer”)

Code

# Create a custom color 
library(RColorBrewer)   
rgb.palette <- colorRampPalette(c("red","yellow","green", "blue"),
space = "rgb")

# Create plot
ggplot(mf, aes(y=SOC, x=NLCD)) +
  geom_point(aes(colour=MAT),size = I(1.7),
             position=position_jitter(width=0.05, height=0.05)) +
  geom_boxplot(fill=NA, outlier.colour=NA) +
  labs(title="")+
  # Change figure orientation 
  coord_flip()+
  # add custom color plate and legend title
  scale_colour_gradientn(name="Soil SOC (mg/g)", colours =rgb.palette(10))+
  theme(legend.text = element_text(size = 10),legend.title = element_text(size = 12))+
  # add y-axis title and x-axis title leave blank
  labs(y="MAT", x = "")+ 
  # add plot title
  ggtitle("Variability of Soil SOC in relation to MAT in four NLCD")+
  # customize plot themes
  theme(
        axis.line = element_line(colour = "black"),
        # plot title position at center
        plot.title = element_text(hjust = 0.5),
        # axis title font size
        axis.title.x = element_text(size = 14), 
        # X and  axis font size
        axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5, colour='black'),
        axis.text.x = element_text(size=12))

Tukey’s method

Another common method for identifying outliers is Tukey’s method, which is based on the concept of the IQR. Divide the data set into Quartiles:

Q1 (the 1st quartile): 25% of the data are less than or equal to this value
Q3 (the 3rd quartile): 25% of the data are greater than or equal to this value
IQR (the interquartile range): the distance between Q3 – Q1, it contains the middle 50% of the data.
Then outliers are then defined as any values that fall outside of:

Lower Range = Q1 – (1.5 * IQR)

and

Upper Range = Q3 + (1.5 * IQR)

Below detect_outliers function:

Code

detect_outliers <- function(x)
{
  ## Check if data is numeric in nature
  if(!(class(x) %in% c("numeric","integer")))
  {
    stop("Data provided must be of integer\numeric type")
  }
  ## Calculate lower limit
  lower.limit <- as.numeric(quantile(x)[2] - IQR(x)*1.5)
  ## Calculate upper limit
  upper.limit <- as.numeric(quantile(x)[4] + IQR(x)*1.5)
  ## Retrieve index of elements which are outliers
  lower.index <- which(x < lower.limit)
  upper.index <- which(x > upper.limit)
  ## print results
  cat(" Lower Limit ",lower.limit ,"\n", "Upper Limit", upper.limit ,"\n",
"Lower range outliers  ",x[lower.index] ,"\n", "Upper range outlers", x[upper.index])
}

Code

detect_outliers(mf$SOC)

 Lower Limit  -6.1465 
 Upper Limit 17.6295 
 Lower range outliers    
 Upper range outlers 18.142 24.954 21.455 21.644 19.092 21.076 30.473 18.391 20.061 21.125 18.591 19.099 17.884 21.591 18.277 27.984 18.814 23.058 23.19 18.062

Z-score method

The Z-score method is another common statistical method for identifying outliers. It finds value with largest difference between it and sample mean, which can be considered as an outlier. In R, the outliers() function from the outliers package can be used to identify outliers using the Z-score method.

install.packages(“outliers”)

Code

#library(outliers)
# Identify outliers using the Z-score method
outlier(mf$SOC, opposite=F)

[1] 30.473

Descriptive Statistics

Descriptive statistics are a set of methods used to summarize and describe the characteristics of a dataset. Typically, they include measures of central tendency, such as mean, median, and mode; measures of dispersion, such as variance, standard deviation, and range; and measures of shape, such as skewness and kurtosis. These statistics help to understand data distribution and make meaningful inferences about it.

The summarize() function in R is used to create summary statistics of a data frame or a grouped data frame. This function can be used to calculate a variety of summary statistics, including mean, median, minimum, maximum, standard deviation, and percentiles.

The basic syntax for using the summarize() function is as follows:

Code

summary(mf)

       ID             FIPS          STATE_ID       STATE          
 Min.   :  1.0   Min.   : 8001   Min.   : 8.0   Length:467        
 1st Qu.:120.5   1st Qu.: 8109   1st Qu.: 8.0   Class :character  
 Median :238.0   Median :20193   Median :20.0   Mode  :character  
 Mean   :237.9   Mean   :29151   Mean   :29.1                     
 3rd Qu.:356.5   3rd Qu.:56001   3rd Qu.:56.0                     
 Max.   :473.0   Max.   :56045   Max.   :56.0                     
    COUNTY            Longitude          Latitude          SOC        
 Length:467         Min.   :-111.01   Min.   :31.51   Min.   : 0.408  
 Class :character   1st Qu.:-107.54   1st Qu.:37.19   1st Qu.: 2.769  
 Mode  :character   Median :-105.31   Median :38.75   Median : 4.971  
                    Mean   :-104.48   Mean   :38.86   Mean   : 6.351  
                    3rd Qu.:-102.59   3rd Qu.:41.04   3rd Qu.: 8.713  
                    Max.   : -94.92   Max.   :44.99   Max.   :30.473  
      DEM             Aspect           Slope              TPI            
 Min.   : 258.6   Min.   : 86.89   Min.   : 0.6493   Min.   :-26.708651  
 1st Qu.:1175.3   1st Qu.:148.79   1st Qu.: 1.4507   1st Qu.: -0.804917  
 Median :1592.9   Median :164.04   Median : 2.7267   Median : -0.042385  
 Mean   :1632.0   Mean   :165.34   Mean   : 4.8399   Mean   :  0.009365  
 3rd Qu.:2238.3   3rd Qu.:178.81   3rd Qu.: 7.1353   3rd Qu.:  0.862883  
 Max.   :3618.0   Max.   :255.83   Max.   :26.1042   Max.   : 16.706257  
    KFactor            MAP              MAT               NDVI       
 Min.   :0.0500   Min.   : 193.9   Min.   :-0.5911   Min.   :0.1424  
 1st Qu.:0.1933   1st Qu.: 354.2   1st Qu.: 5.8697   1st Qu.:0.3083  
 Median :0.2800   Median : 433.8   Median : 9.1728   Median :0.4172  
 Mean   :0.2558   Mean   : 500.9   Mean   : 8.8795   Mean   :0.4368  
 3rd Qu.:0.3200   3rd Qu.: 590.7   3rd Qu.:12.4443   3rd Qu.:0.5566  
 Max.   :0.4300   Max.   :1128.1   Max.   :16.8743   Max.   :0.7970  
    SiltClay          NLCD               FRG               log_SOC       
 Min.   : 9.162   Length:467         Length:467         Min.   :-0.3893  
 1st Qu.:42.933   Class :character   Class :character   1st Qu.: 0.4424  
 Median :52.194   Mode  :character   Mode  :character   Median : 0.6964  
 Mean   :53.813                                         Mean   : 0.6599  
 3rd Qu.:62.878                                         3rd Qu.: 0.9402  
 Max.   :89.834                                         Max.   : 1.4839  
    sqrt_SOC         inv_SOC            bc_SOC       
 Min.   :0.6387   Min.   :0.03282   Min.   :-0.8181  
 1st Qu.:1.6642   1st Qu.:0.11477   1st Qu.: 1.1342  
 Median :2.2296   Median :0.20117   Median : 1.9023  
 Mean   :2.3353   Mean   :0.33227   Mean   : 1.8915  
 3rd Qu.:2.9518   3rd Qu.:0.36108   3rd Qu.: 2.7321  
 Max.   :5.5202   Max.   :2.45098   Max.   : 4.9709

Summary table with {dplyr}

In order to present a comprehensive overview of important statistics, we will be utilizing a variety of functions from different packages. Specifically, we will be using the select() and summarise_all() functions from the {dplyr} package, as well as the pivot_longer() function from the {tidyr} package. This will allow us to efficiently organize and summarize the data in a way that is both informative and easily digestible.

install.packages (“flextable”)

Code

summary_stat <- mf  |>  
  # First select  numerical columns
  dplyr::select(SOC, DEM, SOC, Slope, Aspect, TPI, KFactor, MAP, MAT, NDVI, SiltClay)  |> 
  # get summary statistics
  dplyr::summarise_all(funs(Min = round(min(.),2), 
                      Q25 = round(quantile(., 0.25),2), 
                      Median = round(median(.),2), 
                      Q75 = round(quantile(., 0.75),2), 
                      Max = round(max(.),2),
                      Mean = round(mean(.),2), 
                      SD = round(sd(.),3)))  |> 
                # create a nice looking table
                tidyr::pivot_longer(everything(), 
                            names_sep = "_", 
                            names_to = c( "variable", ".value"))

flextable::flextable(summary_stat, theme_fun = theme_booktabs)

variable	Min	Q25	Median	Q75	Max	Mean	SD
SOC	0.41	2.77	4.97	8.71	30.47	6.35	5.045
DEM	258.65	1,175.33	1,592.89	2,238.27	3,618.02	1,632.03	770.288
Slope	0.65	1.45	2.73	7.14	26.10	4.84	4.703
Aspect	86.89	148.79	164.04	178.81	255.83	165.34	24.384
TPI	-26.71	-0.80	-0.04	0.86	16.71	0.01	3.578
KFactor	0.05	0.19	0.28	0.32	0.43	0.26	0.086
MAP	193.91	354.18	433.80	590.73	1,128.11	500.85	207.089
MAT	-0.59	5.87	9.17	12.44	16.87	8.88	4.102
NDVI	0.14	0.31	0.42	0.56	0.80	0.44	0.162
SiltClay	9.16	42.93	52.19	62.88	89.83	53.81	17.214

Summary statistics with {gtsummary}

The {gtsummary} package is a powerful tool that enables effortless creation of analytical and summary tables using the R programming language. It boasts a high degree of flexibility and elegance, making it a popular choice among data analysts and researchers alike. The package provides an array of sensible defaults for summarizing a variety of data sets, regression models, and statistical analyses, which can be easily customized to suit specific needs. With its intuitive syntax and comprehensive documentation, {gtsummary} is a user-friendly option for generating publication-ready tables in a wide range of research fields.

For creating a summary table using the tbl_summary() function of {gtsummary} n use data as input and returns a summary table.

Code

#library(gtsummary)
mf  |> 
  dplyr::select(NLCD, SOC, DEM, MAP, MAT, NDVI ) |> 
  gtsummary::tbl_summary()

Characteristic	N = 467¹
NLCD
Forest	93 (20%)
Herbaceous	150 (32%)
Planted/Cultivated	97 (21%)
Shrubland	127 (27%)
SOC	5.0 (2.8, 8.8)
DEM	1,593 (1,175, 2,239)
MAP	434 (353, 591)
MAT	9.2 (5.9, 12.5)
NDVI	0.42 (0.31, 0.56)
¹ n (%); Median (Q1, Q3)

There are several ways you can customize the output table. Here below some example:

Code

#| warning: false
#| error: false

#library(gtsummary)
mf  |>  dplyr::select(NLCD, NLCD, SOC, DEM, MAP, MAT, NDVI)  |> 
  gtsummary::tbl_summary(by = NLCD, 
                          statistic = list(
                          all_continuous() ~ "{mean} ({sd})",
                          all_categorical() ~ "{n} / {N} ({p}%)"
                          ),
                          missing_text = "Missing-values")  |> 
                          add_p()

Characteristic	Forest N = 93¹	Herbaceous N = 150¹	Planted/Cultivated N = 97¹	Shrubland N = 127¹	p-value²
SOC	10.4 (6.8)	5.5 (3.9)	6.7 (3.6)	4.1 (3.7)	<0.001
DEM	2,567 (336)	1,363 (591)	804 (489)	1,897 (432)	<0.001
MAP	593 (157)	473 (190)	647 (233)	355 (109)	<0.001
MAT	4.7 (3.0)	10.1 (2.9)	11.8 (2.1)	8.2 (4.6)	<0.001
NDVI	0.57 (0.12)	0.40 (0.13)	0.53 (0.12)	0.31 (0.13)	<0.001
¹ Mean (SD)
² Kruskal-Wallis rank sum test

Summary statistics with {rstatix}

You can use the get_summary_stats() function from {rstatix} to return summary statistics in a data frame format. This can be helpful for performing subsequent operations or plotting on the numbers. See the Simple statistical tests page for more details on the {rstatix} package and its functions.

Code

mf  |> 
  dplyr::select(NLCD, NLCD, SOC, DEM, MAP, MAT, NDVI)  |> 
  get_summary_stats (type = "common")                    # summary stats to return

# A tibble: 5 × 10
  variable     n     min      max   median     iqr    mean      sd     se     ci
  <fct>    <dbl>   <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
1 SOC        467   0.408   30.5      4.97  5.94e+0 6.35e+0   5.04   0.233  0.459
2 DEM        467 259.    3618.    1593.    1.06e+3 1.63e+3 770.    35.6   70.0  
3 MAP        467 194.    1128.     434.    2.37e+2 5.01e+2 207.     9.58  18.8  
4 MAT        467  -0.591   16.9      9.17  6.58e+0 8.88e+0   4.10   0.19   0.373
5 NDVI       467   0.142    0.797    0.417 2.48e-1 4.37e-1   0.162  0.007  0.015

Summary statistics by group with {plyr}

Before creating bar plots, we are going to calculate summary statistics of soil SOC grouped by Landcover. We will us ddply() function from {plyr} package.

Code

# Standard error
SE <- function(x){
  sd(x)/sqrt(length(x))
}
# Get summary statistics
summarise_soc<-plyr::ddply(mf,~NLCD, summarise, 
                Mean= round(mean(SOC), 3),
                Median=round (median(SOC), 3),
                Min= round (min(SOC),3), 
                Max= round (max(SOC),3), 
                SD= round(sd(SOC), 3), 
                SE= round (SE(SOC), 3))
# Load library
library(flextable)
# Create a table
flextable::flextable(summarise_soc, theme_fun = theme_booktabs)

NLCD	Mean	Median	Min	Max	SD	SE
Forest	10.431	8.974	1.333	30.473	6.802	0.705
Herbaceous	5.477	4.609	0.408	18.814	3.925	0.320
Planted/Cultivated	6.697	6.230	0.462	16.336	3.598	0.365
Shrubland	4.131	2.996	0.446	19.099	3.745	0.332

Barplot

Code

#| warning: false
#| error: false
#| fig.width: 5
#| fig.height: 4.5

ggplot(summarise_soc, aes(x=NLCD, y=Mean)) + 
  geom_bar(stat="identity", position=position_dodge(),width=0.5, fill="gray") +
  geom_errorbar(aes(ymin=Mean-SE, ymax=Mean+SE), width=.2,
   position=position_dodge(.9))+
  # add y-axis title and x-axis title leave blank
  labs(y="Soil SOC (mg/g)", x = "")+ 
  # add plot title
  ggtitle("Mean ± SE of Soil SOC")+
  coord_flip()+
  # customize plot themes
  theme(
        axis.line = element_line(colour = "gray"),
        # plot title position at center
        plot.title = element_text(hjust = 0.5),
        # axis title font size
        axis.title.x = element_text(size = 14), 
        # X and  axis font size
        axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5, colour='black'),
        axis.text.x = element_text(size=12))

T-test

A t-test is a statistical hypothesis test that is used to determine if there is a significant difference between the means of two groups or samples. It is a commonly used parametric test assuming the data follows a normal distribution.

One-sample t-test

The one-samples t-test is used to determine if a sample mean is significantly different from a known population mean.

\[ t = \frac{\bar{X} - \mu_0}{\frac{s}{\sqrt{n}}} \]

\(t\) is the t-statistic.
\(\bar{X}\) is the sample mean.
\(\mu_0\) is the hypothesized population mean.
\(s\) is the sample standard deviation.
\(n\) is the number of observations in the sample.

The t.test() fuction below is used to determine if a sample mean is significantly different from a known population mean.

Code

set.seed(42)  # Setting seed for reproducibility
# sample
sample_data <- sample(as_df$GAs, 50)
sample_mean<-mean(sample)

Warning in mean.default(sample): argument is not numeric or logical: returning
NA

Code

sample_mean

[1] NA

Code

# population
population_mean<-mean(as_df$GAs)
population_mean

[1] 1.36003

Code

# One-sample t-test (testing if the sample_mean is significantly different population_mean)
t_test_result <- t.test(sample_data, mu = population_mean)
print(t_test_result)


    One Sample t-test

data:  sample_data
t = -0.033051, df = 49, p-value = 0.9738
alternative hypothesis: true mean is not equal to 1.36003
95 percent confidence interval:
 1.214169 1.501171
sample estimates:
mean of x 
  1.35767

In this example, a one-sample t-test has performed to determine if the mean of the sample is significantly different from true mean. The t.test() function computes the test statistic, the p-value, and provides information to decide whether to reject the null hypothesis.Since the p-value is greater than the level of significance (α) = 0.05, we may to accept the null hypothesis and means sample and population are not significantly different.

Two-sample t-test

A two-sample t-test is a statistical test used to determine if two sets of data are significantly different from each other. It is typically used when comparing the means of two groups of data to determine if they come from populations with different means. The two-sample t-test is based on the assumption that the two populations are normally distributed and have equal variances. The test compares the means of the two samples and calculates a t-value based on the difference between the means and the standard error of the means.

\[ t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]

\(t\) is the t-statistic.
\(\bar{X}_1\) and \(\bar{X}_2\) are the means of the two independent samples.
\(s_1\) and \(s_2\) are the standard deviations of the two independent samples.
\(n_1\) and \(n_2\) are the sizes of the two independent samples.

The null hypothesis of the test is that the means of the two populations are equal, while the alternative hypothesis is that the means are not equal. If the calculated t-value is large enough and the associated p-value is below a chosen significance level (e.g. 0.05), then the null hypothesis is rejected in favor of the alternative hypothesis, indicating that there is a significant difference between the means of the two populations.

There are several variations of the two-sample t-test, including the independent samples t-test and the paired samples t-test, which are used in different situations depending on the nature of the data being compared.

In this example, a two-sample t-test will performed to determine if the mean rice yield in low and high arsenic soil is significantly different.

Code

# Create two dataframes
Low <- as_df |> 
        dplyr::select(TREAT_ID, GY)  |> 
        filter(TREAT_ID == 1)

High <- as_df |>  
        dplyr::select(TREAT_ID, GY) |> 
        filter(TREAT_ID == 2)

# Two-Sample T-test
t.test(x=Low$GY,
       y=High$GY, 
       paired = TRUE,
       alternative = "greater")


    Paired t-test

data:  Low$GY and High$GY
t = 12.472, df = 69, p-value < 2.2e-16
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 16.96835      Inf
sample estimates:
mean difference 
        19.5866

Since the p-value is lower than the level of significance (α) = 0.05, we may to reject the null hypothesis and means yields in Low and high arsenic soil are not equal.

Summary and Conclusion

The present tutorial offers a comprehensive guide to data exploration and visualization using R, providing invaluable insights into the techniques and tools that facilitate a deeper understanding of datasets. It highlights the significance of exploratory data analysis (EDA) and subsequently explores various R packages for creating impactful visualizations.

The tutorial covers fundamental exploratory data analysis techniques, including summary statistics, distribution visualizations, and correlation analysis. It delves into R’s built-in functions and packages like ggplot2 for generating a wide array of visualizations, emphasizing their flexibility and customization options.

Visualization plays a pivotal role in the tutorial, showcasing how plots and charts can reveal patterns, trends, and outliers in data. From scatter plots and histograms to boxplots and heatmaps, the expressive power of R is leveraged to create visuals that enhance the comprehension of complex datasets.

Furthermore, the tutorial stresses the importance of effective communication through visual storytelling. It discusses principles of good visualization design, ensuring that graphics convey insights and do so clearly and compellingly.

It is essential to note that data exploration and visualization are iterative processes that can lead to new hypotheses and guide subsequent analyses. To this end, practitioners should practice with diverse datasets, experiment with different plot types, and adapt visualization techniques to the unique characteristics of their data. Armed with the skills acquired in this tutorial, practitioners are optimally prepared to embark on exploratory data analysis journeys and visually communicate the narratives within their datasets.

Basic Data Exploration and Visualization in R

Check and Install Required Packages

Verify installation

Load Packages

Check loaded packages

Data

Data Structure

Missing values

Dealing with missing values

Remove missing values

Impute missing values

Visual Inspection of Data Distribution

Histograms

Kernel Density Plots

QQ-Plot

Normality Test

Skewness and Kurtosis

Data Transformation for Normality

Outliers

Visual inspection with Boxplot

Box-Jitter plot

Tukey’s method

Z-score method

Descriptive Statistics

Summary table with {dplyr}

Summary statistics with {gtsummary}

Summary statistics with {rstatix}

Summary statistics by group with {plyr}

Barplot

T-test

One-sample t-test

Two-sample t-test

Summary and Conclusion

References