This tutorial will provide a comprehensive understanding of data exploration and visualization in R. It will cover the fundamental concepts, techniques, and tools used in this field. You will learn how to import data into R, clean and preprocess it, and create visualizations that allow you to explore and understand your data meaningfully. Additionally, you will discover how to manipulate and transform data, conduct statistical analyses, and interpret your results. Whether you are a beginner or an experienced data analyst, this tutorial will equip you with the skills and knowledge necessary to explore and visualize your data using R effectively.
Check and Install Required Packages
In these exercise we will use following R-Packages:
tydyverse: The tidyverse is a collection of R packages designed for data science.
data.table: data.table provides a high-performance version of base R’s data.frame
plyr: plyr is a set of tools for a common set of problems
flextable: The flextable package provides a framework for easily create tables for reporting and publications.
gtsummary: provides an elegant and flexible way to create publication-ready analytical and summary tables.
report: report’s primary goal is to bridge the gap between R’s output and the formatted results contained in your manuscript.
nortest: Five omnibus tests for testing the composite hypothesis of normality
outliers: collection of some tests commonly used for identifying outliers.
moments:Functions to calculate: moments, Pearson’s kurtosis, Geary’s kurtosis and skewness; tests related to them (Anscombe-Glynn, D’Agostino, Bonett-Seier).
ggside: The package ggside was designed to enable users to add metadata to their ggplots with ease.
RColorBrewer:Provides color schemes for maps (and other graphics)
patchwork:The goal of patchwork is to make it ridiculously simple to combine separate ggplots into the same graphic.
gridExtra:The grid package provides low-level functions to create graphical objects (grobs), and position them on a page in specific viewports.
rstatix: Provides a simple and intuitive pipe-friendly framework, coherent with the ‘tidyverse’ design philosophy, for performing basic statistical tests, including t-test, Wilcoxon test, ANOVA, Kruskal-Wallis and correlation analyses.
In R, the str() function is a useful tool for examining the structure of an object. It provides a compact and informative description of the object’s internal structure, including its type, length, dimensions, and contents.
Code
str(mf.na)
spc_tbl_ [471 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ID : num [1:471] 1 2 3 4 5 6 7 8 9 10 ...
$ FIPS : num [1:471] 56041 56023 56039 56039 56029 ...
$ STATE_ID : num [1:471] 56 56 56 56 56 56 56 56 56 56 ...
$ STATE : chr [1:471] "Wyoming" "Wyoming" "Wyoming" "Wyoming" ...
$ COUNTY : chr [1:471] "Uinta County" "Lincoln County" "Teton County" "Teton County" ...
$ Longitude: num [1:471] -111 -111 -111 -111 -111 ...
$ Latitude : num [1:471] 41.1 42.9 44.5 44.4 44.8 ...
$ SOC : num [1:471] 15.8 15.9 18.1 10.7 10.5 ...
$ DEM : num [1:471] 2229 1889 2423 2484 2396 ...
$ Aspect : num [1:471] 159 157 169 198 201 ...
$ Slope : num [1:471] 5.67 8.91 4.77 7.12 7.95 ...
$ TPI : num [1:471] -0.0857 4.5591 2.6059 5.1469 3.7557 ...
$ KFactor : num [1:471] 0.32 0.261 0.216 0.182 0.126 ...
$ MAP : num [1:471] 468 536 860 869 803 ...
$ MAT : num [1:471] 4.595 3.86 0.886 0.471 0.759 ...
$ NDVI : num [1:471] 0.414 0.694 0.547 0.619 0.584 ...
$ SiltClay : num [1:471] 64.8 72 57.2 55 51.2 ...
$ NLCD : chr [1:471] "Shrubland" "Shrubland" "Forest" "Forest" ...
$ FRG : chr [1:471] "Fire Regime Group IV" "Fire Regime Group IV" "Fire Regime Group V" "Fire Regime Group V" ...
- attr(*, "spec")=
.. cols(
.. ID = col_double(),
.. FIPS = col_double(),
.. STATE_ID = col_double(),
.. STATE = col_character(),
.. COUNTY = col_character(),
.. Longitude = col_double(),
.. Latitude = col_double(),
.. SOC = col_double(),
.. DEM = col_double(),
.. Aspect = col_double(),
.. Slope = col_double(),
.. TPI = col_double(),
.. KFactor = col_double(),
.. MAP = col_double(),
.. MAT = col_double(),
.. NDVI = col_double(),
.. SiltClay = col_double(),
.. NLCD = col_character(),
.. FRG = col_character()
.. )
- attr(*, "problems")=<externalptr>
Missing values
Missing values in a dataset can significantly impact the accuracy and reliability of data analysis. Therefore, dealing with missing values is a crucial and challenging task in data analysis.
In R, the is.na() function is used to test for missing values (NA) in an object. It returns a logical vector indicating which elements of the object are missing.
Code
sum(is.na(mf.na))
[1] 4
Return the column names containing missing observations
Several methods are available to handle missing values in data, including imputation methods such as mean imputation, median imputation, mode imputation, and regression imputation. Another approach is removing the rows or columns containing missing values, known as deletion methods. The choice of method depends on the specific dataset and the analysis goals. It’s essential to consider each approach carefully, as each method has advantages and disadvantages. Furthermore, the method chosen should be able to handle missing values without introducing bias or losing essential information. Therefore, assessing the quality of imputed values and the impact of missing values on the analysis outcome is essential.
Here are some common strategies for handling missing data:
Delete the rows or columns with missing data: This is the simplest approach, but it can result in the loss of valuable information if there are many missing values.
Imputation: Imputation involves filling in the missing values with estimated values based on other data points. Standard imputation techniques include mean imputation, median imputation, and regression imputation.
ultiple imputation: This involves creating several imputed datasets and analyzing each separately, then combining the results to obtain a final estimate.
Modeling: This involves building a predictive model that can estimate the missing values based on other variables in the dataset.
Domain-specific knowledge: Sometimes, domain-specific kno In this exercise, we only show to remove missing values in data-frame. Later we will show you how to impute these missing values with machine learning.
Remove missing values
An shorthand alternative is to simply use na.omit() to omit all rows containing missing values.
We can remove missing values from a data frame using the drop_na() function. This function removes any rows that contain missing values (NA values) in any column.
Here’s an example to remove missing values:
Code
mf<-mf.na |> tidyr::drop_na()sum(is.na(mf))
[1] 0
Impute missing values
Essay way to impute missing values with mean or median values of the observation
Visual inspection of data distribution is an essential step in exploratory data analysis (EDA). It involves creating visualizations to understand the underlying patterns, central tendencies, and spread of the data. Several common techniques for visually inspecting data distribution include:
Histogram
Kernel density Plots
Quantile-Quantile Plots (qq-plot)
Histograms
A histogram is a graphical representation of the distribution of numerical data. They are commonly used in statistics to show the frequency distribution of a data set for identifying patterns, outliers, and skewness in the data. Histograms are also helpful in visualizing the shape of a distribution, such as whether it is symmetric or skewed, and whether it has one or more peaks.
In R, hist() function is used to create histograms of numerical data. It takes one or more numeric vectors as input and displays a graphical representation of the distribution of the data in the form of a histogram.
We can also create histograms using the ggplot2 package, which is a powerful data visualization tool based on The Grammar of Graphics.
The ggplot2 allows you to create complex and customizable graphics using a layered approach, which involves building a plot by adding different layers of visual elements. It comes with tidyverse package.
A kernel density plot (or kernel density estimation plot) is a non-parametric way of visualizing the probability density function of a variable. It represents a smoothed histogram version, where the density estimate is obtained by convolving the data with a kernel function. The kernel density plot is a helpful tool for exploring data distribution, identifying outliers, and comparing distributions.
In practice, kernel density estimation involves selecting a smoothing parameter (known as the bandwidth) that determines the width of the kernel function. The choice of bandwidth can significantly impact the resulting density estimate, and various methods have been developed to select the optimal bandwidth for a given dataset.
To create a density plot in R, you can use the density() function to estimate the density of a variable, and then plot it using the plot() function. If you want to use ggplot2, you can create a density plot with the geom_density() function, like this:
# estimate the densityp<-density(mf$SOC, na.rm =TRUE)# plot densityplot(p, # plot titlemain ="Kernel Density of Soil SOC",# x-axis tittlexlab="Soil SOC (mg/kg)",# y-axis titleylab="Density")
Code
ggplot(mf, aes(SOC)) +geom_density()+# x-axis titlexlab("SOC (mg/g)") +# y-axis titleylab("Density")+# plot titleggtitle("Kernel Density of Soil SOC")+theme(# Center the plot titleplot.title =element_text(hjust =0.5))
QQ-Plot
A Q-Q plot (quantile-quantile plot) is a graphical method used to compare two probability distributions by plotting their quantiles against each other. The two distributions being compared are first sorted in ascending order to create a Q-Q plot. Then, the corresponding quantile in the other distribution is calculated for each data point in one distribution. The resulting pairs of quantiles are plotted on a scatter plot, with one distribution’s quantiles on the x-axis and the other on the y-axis. If the two distributions being compared are identical, the points will fall along a straight line.
Note
Probability distribution is a function that describes the likelihood of obtaining different outcomes in a random event. It gives a list of all possible outcomes and their corresponding probabilities. The probabilities assigned to each outcome must add up to 1, as one of the possible outcomes will always occur.
In R, you can create a Q-Q plot using the two functions:
qqnorm() is a generic function the default method of which produces a normal QQ plot of the values in y.
qqline() adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.
We can create a QQ plot using the ggplot() and stat_qq() functions and then use stat_qq_line() function to add a line indicating the expected values under the assumption of a normal distribution.
# draw normal QQ plotqqnorm(mf$SOC)# Add reference or theoretical lineqqline(mf$SOC,main ="Q-Q plot of Soil SOC from a Normal Distribution",# x-axis tittlexlab="Theoretical Quantiles",# y-axis titleylab="Sample Quantiles")
Code
ggplot(mf, aes(sample = SOC)) +stat_qq() +stat_qq_line() +# x-axis titlexlab("Theoretical Quantiles)") +# y-axis titleylab("Sample Quantiles")+# plot titleggtitle("Q-Q plot of Soil SOC from a Normal Distribution")+theme(# Center the plot titleplot.title =element_text(hjust =0.5),# x-axis title font sizeaxis.title.x =element_text(size =14),# y-axis title font sizeaxis.title.y =element_text(size =14),)
Normality Test
In statistics and data analysis, normality tests are commonly used to determine whether a given data sample is derived from a normally distributed population. A normal distribution is a bell-shaped curve that represents a set of data in which the majority of the observations fall near the mean, with fewer observations at the tails of the distribution. Normality tests provide us with evidence for or against the assumption that the data is normally distributed, but they do not prove normality. Therefore, it is crucial to visually inspect the data and consider the context in which it was collected before assuming normality, even if the data passes a normality test. By doing so, we can ensure that we have a comprehensive understanding of the data and avoid any misleading conclusions.
Statistical tests for normality include:
Shapiro-Wilk test: The Shapiro-Wilk test is a commonly used test to check for normality. The test calculates a W statistic. If the p-value associated with the test is greater than a chosen significance level (usually 0.05), it suggests that the data is normally distributed.
Anderson-Darling test: The Anderson-Darling test is another test that can be used to check for normality. The test calculates an A2 statistic, and if the p-value associated with the test is greater than a chosen significance level, it suggests that the data is normally distributed
Kolmogorov-Smirnov test:The Kolmogorov-Smirnov (KS) test is a statistical test used to determine if two datasets have the same distribution. The test compares the empirical cumulative distribution function (ECDF) of the two datasets and calculates the maximum vertical distance between the two ECDFs. The KS test statistic is the maximum distance and the p-value is calculated based on the distribution of the test statistic.
The shapiro.test() function can be used to perform the Shapiro-Wilk test for normality. The function takes a numeric vector as input and returns a list containing the test statistic and p-value.
Code
shapiro.test(mf$SOC)
Shapiro-Wilk normality test
data: mf$SOC
W = 0.87237, p-value < 2.2e-16
Since the p-value is lower than 0.05, we can conclude that the data is not normally distributed.
The ad.test() function from the nortest package can be used to perform the Anderson-Darling test for normality. The function takes a numeric vector as input and returns a list containing the test statistic and p-value. It is important to note that the Anderson-Darling test is also sensitive to sample size, and may not be accurate for small sample sizes. Additionally, the test may not be appropriate for non-normal distributions with heavy tails or skewness. Therefore, it is recommended to use the test in conjunction with visual inspection of the data to determine if it follows a normal distribution.
install.package(nortest)
Code
#library(nortest)nortest::ad.test(mf$SOC)
Anderson-Darling normality test
data: mf$SOC
A = 16.22, p-value < 2.2e-16
Since the p-value is lower than 0.05, we can conclude that the data is not normally distributed.
The ks.test() function can be used to perform the Kolmogorov-Smirnov test for normality. The function takes a numeric vector as input and returns a list containing the test statistic and p-value. pnorm specifies the cumulative distribution function to use as the reference distribution for the test.
Code
ks.test(mf$SOC, "pnorm")
Asymptotic one-sample Kolmogorov-Smirnov test
data: mf$SOC
D = 0.81199, p-value < 2.2e-16
alternative hypothesis: two-sided
The test result shows that the maximum distance between the two ECDFs is 0.82 and the p-value is very small, which means that we can reject the null hypothesis that the two samples are from the same distribution.
Skewness and Kurtosis
Skewness is a measure of the asymmetry of a probability distribution. In other words, it measures how much a distribution deviates from symmetry or normal distribution. A positive skewness indicates that the distribution has a longer right tail, while a negative skewness indicates that the distribution has a longer left tail. A skewness of zero indicates a perfectly symmetric distribution.
Kurtosis measures how much a distribution deviates from a normal distribution in terms of the concentration of scores around the mean. A positive kurtosis indicates a more peaked distribution than a normal distribution, while a negative kurtosis indicates a flatter distribution than a normal distribution. A kurtosis of zero indicates a normal distribution.
we will use skewness() and kurtosis() functions from the moments library in R:
High positive value indicates that the distribution is highly skewed at right-hand sight, which means that in some sites soil are highly contaminated with SOC.
Again, high positive kurtosis value indicates that Soil SOC is not normal distributed
Data Transformation for Normality
Data transformation is a statistical technique that involves modifying the structure of a dataset to improve its usefulness for analysis. This method involves adjusting the scale or shape of the data distribution to make it more appropriate for statistical analysis. The goal of data transformation is often to create a more normal distribution of the data, which can make it easier to apply many common statistical models and methods. By transforming data, researchers can gain a better understanding of their data and make more accurate inferences about the underlying population.
Here are some common data transformations used to achieve normality:
Logarithmic transformation: If the data are skewed to the right, taking the data’s logarithm can help reduce the skewness and make the distribution more normal.
\(log10(x)\) for positively skewed data,
\(log10(max(x+1) - x)\) for negatively skewed data
Square root transformation: Similar to logarithmic transformation, taking the square root of the data can help reduce the distribution’s skewness.
\(sqrt(x)\) for positively skewed data,
\(sqrt(max(x+1) - x)\) for negatively skewed data
Inverse transformation: If the data are skewed to the left, taking the inverse of the data (1/x) can help to reduce the skewness and make the distribution more normal.
Box-Cox transformation: The Box-Cox transformation is a method of data transformation that is used to stabilize the variance of a dataset and make it more normally distributed. The Box-Cox transformation involves applying a power transformation to the data, which is determined by a parameter λ.
The Box-Cox transformation is defined by a power transformation:
Here, \(y\) represents the original data, and \(\lambda\) is a parameter that controls the transformation. The Box-Cox transformation aims to find the value of \(\lambda\) that maximizes the normality and homoscedasticity of the transformed data.
When choosing a data transformation method, it is important to consider the type of data, the purpose of the analysis, and the assumptions of the statistical method being used. It is also important to check the normality of the transformed data using visual inspection (histogram) and statistical tests, such as the Shapiro-Wilk test or the Anderson-Darling test.
Shapiro-Wilk normality test
data: mf$sqrt_SOC
W = 0.97494, p-value = 3.505e-07
Code
# Inverse transformationmf$inv_SOC<-(1/mf$SOC)# histogramhist(mf$inv_SOC,# plot titlemain ="Histogram of Inverse of Soil SOC",# x-axis titlexlab="Inverse of Soil SOC",# y-axis titleylab="Frequency")
Code
# Shapiro-Wilk testshapiro.test(mf$inv_SOC)
Shapiro-Wilk normality test
data: mf$inv_SOC
W = 0.66851, p-value < 2.2e-16
First we have to calculate appropriate transformation parameters using powerTransform() function of car package and then use this parameter to transform the data using bcPower() function.
install.package(“car”)
Code
#library(car)# Get appropriate power - lambdapower<-car::powerTransform(mf$SOC)power$lambda
mf$SOC
0.2072527
Code
# power transformationmf$bc_SOC<-car::bcPower(mf$SOC, power$lambda)# histogramhist(mf$bc_SOC,# plot titlemain ="Histogram of BoxCox Soil SOC",# x-axis titlexlab="BoxCox of Soil SOC",# y-axis titleylab="Frequency")
Code
# Shapiro-Wilk testshapiro.test(mf$bc_SOC)
Shapiro-Wilk normality test
data: mf$bc_SOC
W = 0.99393, p-value = 0.05912
Outliers
Outliers are data points that have values significantly different from the other data points in a dataset. They can be caused by various factors, such as measurement errors, experimental errors, or rare extreme values. Outliers can have a significant impact on the analysis of a dataset, as they can skew the results and lead to inaccurate conclusions.
There are various methods for detecting outliers, such as statistical methods, visual analysis, or machine learning algorithms. However, outlier detection is not always straightforward, and different methods may identify different outliers. Therefore, it’s crucial to understand the nature of the data and the context in which it was collected to choose the most appropriate method for outlier detection. Moreover, it’s essential to determine whether the outliers are genuine data points or errors. In some cases, outliers may be valid and provide valuable insights into the underlying patterns or phenomena. In other cases, outliers may be artifacts of the data collection process or measurement errors and need to be removed or corrected.
In summary, outlier detection is an important step in data analysis, but it requires careful consideration and judgement to ensure that the results are reliable and accurate.
In this exercise we will discuss some common methods of detecting outleirs, but we will not removing them. Here are some common methods for identifying outliers:
Visual inspection with Boxplot
Plotting the data and visually inspecting it can often reveal outliers. Box plot, also known as a box-and-whisker plot is useful visualizations for this purpose. It is a graphical representation of a dataset that displays its median, quartiles, and any potential outliers. Here’s how to read a boxplot: Boxplots help identifies a dataset’s distribution, spread, and potential outliers. They are often used in exploratory data analysis to gain insights into the characteristics of the data.
• The box represents the interquartile range (IQR), which is the range between the 25th and 75th percentiles of the data. The length of the box represents the range of the middle 50% of the data.
• The line inside the box represents the median of the data, which is the middle value when the data is sorted.
• The whiskers extend from the box and represent the range of the data outside the IQR. They can be calculated in different ways depending on the method used to draw the boxplot. Some common methods include:
Tukey’s method: The whiskers extend to the most extreme data point within 1.5 times the IQR of the box.
Min/max whiskers: The whiskers extend to the minimum and maximum values in the data that are not considered outliers.
5-number summary: The whiskers extend to the minimum and maximum values in the data, excluding outliers.
• Outliers are data points that fall outside the whiskers of the boxplot. They are represented as individual points on the plot.
ggplot2 also create boxplot with ggplot() and geom_boxplot() functions. You can customize the plot further by modifying labels, colors, themes, and adding more layers to the plot using ggplot2 functions and additional aesthetic mappings (aes())
Now check how many observation considered as outlier:
Code
length(boxplot.stats(mf$SOC)$out)
[1] 20
Box-Jitter plot
A box-jitter plot is a type of data visualization that combines elements of both a boxplot and a jitter plot. It is useful for displaying both the distribution and the individual data points of a dataset.
The jitter component of the plot is used to display the individual data points. Jittering refers to adding a small amount of random noise to the position of each data point along the x-axis, to prevent overlapping and to give a better sense of the density of the data.
Overall, the box-jitter plot provides a useful summary of the distribution of a dataset, while also allowing viewers to see the individual data points and any potential outliers. It is particularly useful for displaying data with a large number of observations or for comparing multiple distributions side by side.
We will use ggplot package to create Box-Jitter plots to explore variability of Soil SOC among the Landcover classes. We will create a custom color palette using colorRampPalette() function of RColorBrewer package.
install.packages(“RColorBrewer”)
Code
# Create a custom color library(RColorBrewer) rgb.palette <-colorRampPalette(c("red","yellow","green", "blue"),space ="rgb")# Create plotggplot(mf, aes(y=SOC, x=NLCD)) +geom_point(aes(colour=MAT),size =I(1.7),position=position_jitter(width=0.05, height=0.05)) +geom_boxplot(fill=NA, outlier.colour=NA) +labs(title="")+# Change figure orientation coord_flip()+# add custom color plate and legend titlescale_colour_gradientn(name="Soil SOC (mg/g)", colours =rgb.palette(10))+theme(legend.text =element_text(size =10),legend.title =element_text(size =12))+# add y-axis title and x-axis title leave blanklabs(y="MAT", x ="")+# add plot titleggtitle("Variability of Soil SOC in relation to MAT in four NLCD")+# customize plot themestheme(axis.line =element_line(colour ="black"),# plot title position at centerplot.title =element_text(hjust =0.5),# axis title font sizeaxis.title.x =element_text(size =14), # X and axis font sizeaxis.text.y=element_text(size=12,vjust =0.5, hjust=0.5, colour='black'),axis.text.x =element_text(size=12))
Tukey’s method
Another common method for identifying outliers is Tukey’s method, which is based on the concept of the IQR. Divide the data set into Quartiles:
Q1 (the 1st quartile): 25% of the data are less than or equal to this value
Q3 (the 3rd quartile): 25% of the data are greater than or equal to this value
IQR (the interquartile range): the distance between Q3 – Q1, it contains the middle 50% of the data.
Then outliers are then defined as any values that fall outside of:
detect_outliers <-function(x){## Check if data is numeric in natureif(!(class(x) %in%c("numeric","integer"))) {stop("Data provided must be of integer\numeric type") }## Calculate lower limit lower.limit <-as.numeric(quantile(x)[2] -IQR(x)*1.5)## Calculate upper limit upper.limit <-as.numeric(quantile(x)[4] +IQR(x)*1.5)## Retrieve index of elements which are outliers lower.index <-which(x < lower.limit) upper.index <-which(x > upper.limit)## print resultscat(" Lower Limit ",lower.limit ,"\n", "Upper Limit", upper.limit ,"\n","Lower range outliers ",x[lower.index] ,"\n", "Upper range outlers", x[upper.index])}
The Z-score method is another common statistical method for identifying outliers. It finds value with largest difference between it and sample mean, which can be considered as an outlier. In R, the outliers() function from the outliers package can be used to identify outliers using the Z-score method.
install.packages(“outliers”)
Code
#library(outliers)# Identify outliers using the Z-score methodoutlier(mf$SOC, opposite=F)
[1] 30.473
Descriptive Statistics
Descriptive statistics are a set of methods used to summarize and describe the characteristics of a dataset. Typically, they include measures of central tendency, such as mean, median, and mode; measures of dispersion, such as variance, standard deviation, and range; and measures of shape, such as skewness and kurtosis. These statistics help to understand data distribution and make meaningful inferences about it.
The summarize() function in R is used to create summary statistics of a data frame or a grouped data frame. This function can be used to calculate a variety of summary statistics, including mean, median, minimum, maximum, standard deviation, and percentiles.
The basic syntax for using the summarize() function is as follows:
Code
summary(mf)
ID FIPS STATE_ID STATE
Min. : 1.0 Min. : 8001 Min. : 8.0 Length:467
1st Qu.:120.5 1st Qu.: 8109 1st Qu.: 8.0 Class :character
Median :238.0 Median :20193 Median :20.0 Mode :character
Mean :237.9 Mean :29151 Mean :29.1
3rd Qu.:356.5 3rd Qu.:56001 3rd Qu.:56.0
Max. :473.0 Max. :56045 Max. :56.0
COUNTY Longitude Latitude SOC
Length:467 Min. :-111.01 Min. :31.51 Min. : 0.408
Class :character 1st Qu.:-107.54 1st Qu.:37.19 1st Qu.: 2.769
Mode :character Median :-105.31 Median :38.75 Median : 4.971
Mean :-104.48 Mean :38.86 Mean : 6.351
3rd Qu.:-102.59 3rd Qu.:41.04 3rd Qu.: 8.713
Max. : -94.92 Max. :44.99 Max. :30.473
DEM Aspect Slope TPI
Min. : 258.6 Min. : 86.89 Min. : 0.6493 Min. :-26.708651
1st Qu.:1175.3 1st Qu.:148.79 1st Qu.: 1.4507 1st Qu.: -0.804917
Median :1592.9 Median :164.04 Median : 2.7267 Median : -0.042385
Mean :1632.0 Mean :165.34 Mean : 4.8399 Mean : 0.009365
3rd Qu.:2238.3 3rd Qu.:178.81 3rd Qu.: 7.1353 3rd Qu.: 0.862883
Max. :3618.0 Max. :255.83 Max. :26.1042 Max. : 16.706257
KFactor MAP MAT NDVI
Min. :0.0500 Min. : 193.9 Min. :-0.5911 Min. :0.1424
1st Qu.:0.1933 1st Qu.: 354.2 1st Qu.: 5.8697 1st Qu.:0.3083
Median :0.2800 Median : 433.8 Median : 9.1728 Median :0.4172
Mean :0.2558 Mean : 500.9 Mean : 8.8795 Mean :0.4368
3rd Qu.:0.3200 3rd Qu.: 590.7 3rd Qu.:12.4443 3rd Qu.:0.5566
Max. :0.4300 Max. :1128.1 Max. :16.8743 Max. :0.7970
SiltClay NLCD FRG log_SOC
Min. : 9.162 Length:467 Length:467 Min. :-0.3893
1st Qu.:42.933 Class :character Class :character 1st Qu.: 0.4424
Median :52.194 Mode :character Mode :character Median : 0.6964
Mean :53.813 Mean : 0.6599
3rd Qu.:62.878 3rd Qu.: 0.9402
Max. :89.834 Max. : 1.4839
sqrt_SOC inv_SOC bc_SOC
Min. :0.6387 Min. :0.03282 Min. :-0.8181
1st Qu.:1.6642 1st Qu.:0.11477 1st Qu.: 1.1342
Median :2.2296 Median :0.20117 Median : 1.9023
Mean :2.3353 Mean :0.33227 Mean : 1.8915
3rd Qu.:2.9518 3rd Qu.:0.36108 3rd Qu.: 2.7321
Max. :5.5202 Max. :2.45098 Max. : 4.9709
Summary table with {dplyr}
In order to present a comprehensive overview of important statistics, we will be utilizing a variety of functions from different packages. Specifically, we will be using the select() and summarise_all() functions from the {dplyr} package, as well as the pivot_longer() function from the {tidyr} package. This will allow us to efficiently organize and summarize the data in a way that is both informative and easily digestible.
The {gtsummary} package is a powerful tool that enables effortless creation of analytical and summary tables using the R programming language. It boasts a high degree of flexibility and elegance, making it a popular choice among data analysts and researchers alike. The package provides an array of sensible defaults for summarizing a variety of data sets, regression models, and statistical analyses, which can be easily customized to suit specific needs. With its intuitive syntax and comprehensive documentation, {gtsummary} is a user-friendly option for generating publication-ready tables in a wide range of research fields.
For creating a summary table using the tbl_summary() function of {gtsummary} n use data as input and returns a summary table.
You can use the get_summary_stats() function from {rstatix} to return summary statistics in a data frame format. This can be helpful for performing subsequent operations or plotting on the numbers. See the Simple statistical tests page for more details on the {rstatix} package and its functions.
# A tibble: 5 × 10
variable n min max median iqr mean sd se ci
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SOC 467 0.408 30.5 4.97 5.94e+0 6.35e+0 5.04 0.233 0.459
2 DEM 467 259. 3618. 1593. 1.06e+3 1.63e+3 770. 35.6 70.0
3 MAP 467 194. 1128. 434. 2.37e+2 5.01e+2 207. 9.58 18.8
4 MAT 467 -0.591 16.9 9.17 6.58e+0 8.88e+0 4.10 0.19 0.373
5 NDVI 467 0.142 0.797 0.417 2.48e-1 4.37e-1 0.162 0.007 0.015
Summary statistics by group with {plyr}
Before creating bar plots, we are going to calculate summary statistics of soil SOC grouped by Landcover. We will us ddply() function from {plyr} package.
Code
# Standard errorSE <-function(x){sd(x)/sqrt(length(x))}# Get summary statisticssummarise_soc<-plyr::ddply(mf,~NLCD, summarise, Mean=round(mean(SOC), 3),Median=round (median(SOC), 3),Min=round (min(SOC),3), Max=round (max(SOC),3), SD=round(sd(SOC), 3), SE=round (SE(SOC), 3))# Load librarylibrary(flextable)# Create a tableflextable::flextable(summarise_soc, theme_fun = theme_booktabs)
NLCD
Mean
Median
Min
Max
SD
SE
Forest
10.431
8.974
1.333
30.473
6.802
0.705
Herbaceous
5.477
4.609
0.408
18.814
3.925
0.320
Planted/Cultivated
6.697
6.230
0.462
16.336
3.598
0.365
Shrubland
4.131
2.996
0.446
19.099
3.745
0.332
Barplot
Code
#| warning: false#| error: false#| fig.width: 5#| fig.height: 4.5ggplot(summarise_soc, aes(x=NLCD, y=Mean)) +geom_bar(stat="identity", position=position_dodge(),width=0.5, fill="gray") +geom_errorbar(aes(ymin=Mean-SE, ymax=Mean+SE), width=.2,position=position_dodge(.9))+# add y-axis title and x-axis title leave blanklabs(y="Soil SOC (mg/g)", x ="")+# add plot titleggtitle("Mean ± SE of Soil SOC")+coord_flip()+# customize plot themestheme(axis.line =element_line(colour ="gray"),# plot title position at centerplot.title =element_text(hjust =0.5),# axis title font sizeaxis.title.x =element_text(size =14), # X and axis font sizeaxis.text.y=element_text(size=12,vjust =0.5, hjust=0.5, colour='black'),axis.text.x =element_text(size=12))
T-test
A t-test is a statistical hypothesis test that is used to determine if there is a significant difference between the means of two groups or samples. It is a commonly used parametric test assuming the data follows a normal distribution.
One-sample t-test
The one-samples t-test is used to determine if a sample mean is significantly different from a known population mean.
\[ t = \frac{\bar{X} - \mu_0}{\frac{s}{\sqrt{n}}} \]
\(t\) is the t-statistic.
\(\bar{X}\) is the sample mean.
\(\mu_0\) is the hypothesized population mean.
\(s\) is the sample standard deviation.
\(n\) is the number of observations in the sample.
The t.test() fuction below is used to determine if a sample mean is significantly different from a known population mean.
Code
set.seed(42) # Setting seed for reproducibility# samplesample_data <-sample(as_df$GAs, 50)sample_mean<-mean(sample)
Warning in mean.default(sample): argument is not numeric or logical: returning
NA
# One-sample t-test (testing if the sample_mean is significantly different population_mean)t_test_result <-t.test(sample_data, mu = population_mean)print(t_test_result)
One Sample t-test
data: sample_data
t = -0.033051, df = 49, p-value = 0.9738
alternative hypothesis: true mean is not equal to 1.36003
95 percent confidence interval:
1.214169 1.501171
sample estimates:
mean of x
1.35767
In this example, a one-sample t-test has performed to determine if the mean of the sample is significantly different from true mean. The t.test() function computes the test statistic, the p-value, and provides information to decide whether to reject the null hypothesis.Since the p-value is greater than the level of significance (α) = 0.05, we may to accept the null hypothesis and means sample and population are not significantly different.
Two-sample t-test
A two-sample t-test is a statistical test used to determine if two sets of data are significantly different from each other. It is typically used when comparing the means of two groups of data to determine if they come from populations with different means. The two-sample t-test is based on the assumption that the two populations are normally distributed and have equal variances. The test compares the means of the two samples and calculates a t-value based on the difference between the means and the standard error of the means.
\[ t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]
\(t\) is the t-statistic.
\(\bar{X}_1\) and \(\bar{X}_2\) are the means of the two independent samples.
\(s_1\) and \(s_2\) are the standard deviations of the two independent samples.
\(n_1\) and \(n_2\) are the sizes of the two independent samples.
The null hypothesis of the test is that the means of the two populations are equal, while the alternative hypothesis is that the means are not equal. If the calculated t-value is large enough and the associated p-value is below a chosen significance level (e.g. 0.05), then the null hypothesis is rejected in favor of the alternative hypothesis, indicating that there is a significant difference between the means of the two populations.
There are several variations of the two-sample t-test, including the independent samples t-test and the paired samples t-test, which are used in different situations depending on the nature of the data being compared.
In this example, a two-sample t-test will performed to determine if the mean rice yield in low and high arsenic soil is significantly different.
Paired t-test
data: Low$GY and High$GY
t = 12.472, df = 69, p-value < 2.2e-16
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
16.96835 Inf
sample estimates:
mean difference
19.5866
Since the p-value is lower than the level of significance (α) = 0.05, we may to reject the null hypothesis and means yields in Low and high arsenic soil are not equal.
Summary and Conclusion
The present tutorial offers a comprehensive guide to data exploration and visualization using R, providing invaluable insights into the techniques and tools that facilitate a deeper understanding of datasets. It highlights the significance of exploratory data analysis (EDA) and subsequently explores various R packages for creating impactful visualizations.
The tutorial covers fundamental exploratory data analysis techniques, including summary statistics, distribution visualizations, and correlation analysis. It delves into R’s built-in functions and packages like ggplot2 for generating a wide array of visualizations, emphasizing their flexibility and customization options.
Visualization plays a pivotal role in the tutorial, showcasing how plots and charts can reveal patterns, trends, and outliers in data. From scatter plots and histograms to boxplots and heatmaps, the expressive power of R is leveraged to create visuals that enhance the comprehension of complex datasets.
Furthermore, the tutorial stresses the importance of effective communication through visual storytelling. It discusses principles of good visualization design, ensuring that graphics convey insights and do so clearly and compellingly.
It is essential to note that data exploration and visualization are iterative processes that can lead to new hypotheses and guide subsequent analyses. To this end, practitioners should practice with diverse datasets, experiment with different plot types, and adapt visualization techniques to the unique characteristics of their data. Armed with the skills acquired in this tutorial, practitioners are optimally prepared to embark on exploratory data analysis journeys and visually communicate the narratives within their datasets.