Descriptive Statistics With R

This tutorial is designed to provide a comprehensive introduction to descriptive statistics using the R programming language. It covers essential concepts, techniques, and practical applications, making it suitable for beginners and those looking to refresh their skills. By the end of this tutorial, you will have a solid understanding of descriptive statistics and the ability to apply these concepts using R.

Introduction

Descriptive Statistics is a field that assists in describing and summarizing data. It encompasses mean, median, mode, range, variance, and standard deviation metrics. These measures provide researchers with a clear idea of the central tendency and dispersion of the data, thereby serving as a concise summary of the data. Calculating and analyzing the mean, median, mode, range, variance, and standard deviation metrics are fundamental concepts in descriptive statistics. These measures help researchers make informed decisions and draw meaningful conclusions by better understanding the data’s central tendency and dispersion. Descriptive statistics can also identify patterns, trends, and outliers that might be difficult to identify otherwise by providing a concise data summary.

This tutorial has been designed to provide basic knowledge about descriptive statistics and how to perform them in R.

Data

All data set use in this exercise can be downloaded from my Dropbox or from my Github accounts.

We will use read.csv() function to import data.

Code

df<-read.csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/rice_arsenic_data.csv", header=TRUE)

Central tendency

Central tendency is a statistical measure that assists in describing the center point of a set of data values. This concept is used to identify a single value that is considered most representative of the entire distribution. By determining the central tendency, we can gain insights into the typical or common values in a dataset. Mean, median, and mode are the three most commonly used measures of central tendency.

Mean

The average or mean is a statistical measure that is used to determine the central tendency of a set of data. It is obtained by adding up all the quantities or values (\(X\)) in the set and dividing the total by the number of values in the set (\(n\)).

\[ \bar{X} = \frac{\sum_{i=1}^{n}X_i}{n}\ \]

For example, let’s say you have a set of numbers like 2, 4, 6, 8, and 10. To find the average, you would add all the numbers together (2 + 4 + 6 + 8 + 10 = 30) and then divide the sum by the number of values in the set, which in this case is 5. The resulting average is 6.

The mean is a useful measure because it provides a single value that can represent the entire data set. It is commonly used in various fields such as mathematics, science, economics, and social sciences to analyze and interpret data.

Overall mean yield can be calculated by mean() function:

Code

mean(df$GY)

[1] 28.66472

We can use aggregate() function to calculate mean yield by soil groups (TREAT):

Code

aggregate(df$GY, list(df$TREAT), FUN=mean)

   Group.1        x
1 High As  18.87141
2   Low As 38.45802

Median

In statistics, the median is a measure of central tendency that is used to determine the middle value in a set of data when the values are arranged either in ascending or descending order. It is a valuable statistical tool that helps tdescribe the center of a data set, particularly when extreme valuescould potentially skew the mean. When there is an odd number of values in a data set, the median is simply the middle number. In contrast, when there is an even number of values, the median is calculated by taking the average of the two middle numbers. This method ensures that the median is representative of the central tendency of the data set, regardless of its size or distribution.

for an odd number of observations:

\[ \text{Median} = X_{\left(\frac{n+1}{2}\right)} \]

for an even number of observations:

\[ \text{Median} = \frac{X_{\left(\frac{n}{2}\right)} + X_{\left(\frac{n}{2}+1\right)}}{2} \]

Overall, the median provides a clear and concise way to analyze and interpret data sets, especially those with outliers or extreme values that could affect other measures of central tendency. Its importance cannot be overstated, and it is a fundamental concept that every student of statistics should be familiar with.

Median yield can be calculated by median() function:

Code

median(df$GY)

[1] 25.31435

Mode

In statistics, mode is a measure that represents the most frequently occurring value in a given data set. It is a central tendency measure distinct from mean and median. While mean represents the average value, and median represents the middle value, mode represents the value that occurs most often in a data set. It is particularly useful when analyzing categorical data, or when identifying the most common occurrence within a set of values. A data set can have one mode, or more than one mode if there are multiple values that occur with equal frequency (known as bimodal, trimodal, etc.). On the other hand, if no value repeats, then the data set has no mode at all. The mode is a valuable statistical tool that helps to identify the most typical and frequent occurrence within a given data set.

Code

# Function to calculate mode
getMode <- function(x) {
  unique_x <- unique(x)
  unique_x_counts <- tabulate(match(x, unique_x))
  modes <- unique_x[unique_x_counts == max(unique_x_counts)]
  return(modes)
}

Code

getMode(df$GAs)

  [1] 0.8626439 0.8442584 1.1382471 1.0445282 0.6864139 0.9225152 1.3024528
  [8] 0.9856521 1.1396095 1.0329973 0.7110621 1.0986138 1.1189786 1.2905821
 [15] 0.7069316 0.7160826 0.9956423 0.5887321 1.0425092 0.7932789 1.0306465
 [22] 1.1162025 0.9969186 0.7132064 0.7953316 1.1065553 0.7504826 0.9446115
 [29] 0.8857691 0.6111798 1.0299259 1.3400145 0.6858245 1.2091302 1.1016161
 [36] 0.9714230 1.3367415 0.9938958 0.8661776 0.8207372 1.0642598 0.9213504
 [43] 0.9263001 0.6607738 0.8722756 1.0963266 1.1990640 0.8338511 1.1611438
 [50] 0.7425673 0.8984300 0.6114510 0.9712233 0.7329860 1.0132400 1.0140004
 [57] 0.9764986 1.0602700 1.2119881 1.1988252 1.1181223 0.5185456 0.8842597
 [64] 1.1619926 1.0757041 0.9912792 0.9437969 0.7107184 0.8294098 1.1508683
 [71] 1.6231217 1.5660547 1.8428277 1.4896635 1.6508503 1.6179264 2.1908554
 [78] 2.0650441 2.3612137 1.8171474 1.9164473 1.5735034 1.3295403 1.4838319
 [85] 1.2041797 2.4180126 1.9079189 1.7974379 1.2566674 1.5391449 1.5162307
 [92] 1.5944369 1.5957580 1.4321779 2.1396505 2.1154823 1.9615839 2.1904747
 [99] 1.6266759 1.7863219 2.1467053 1.5166320 2.0127050 1.9478259 0.8160187
[106] 1.7647205 1.7466764 1.5973735 1.5057603 1.9386023 1.7742176 2.4119100
[113] 1.8184004 1.0208152 2.0205704 2.0859089 2.1196892 1.6072958 1.6001129
[120] 1.4951643 2.0637763 1.9480449 1.3037925 2.1083372 1.7728649 2.2382360
[127] 1.0389582 1.2658095 1.4859489 1.5745825 1.5635693 2.1146381 2.0684290
[134] 1.8610596 1.8416452 2.2144129 1.9130549 1.6303901 1.7462559 2.2134436

Range

Measuring the range in statistics is a simple yet useful approach to understanding the dispersion of a dataset. The range refers to the difference between the highest and lowest values in a given dataset, providing insights into the variability or spread of the data. To calculate the range, you need to subtract the smallest value from the largest value in the dataset. This measure can give you a quick idea of how widely the numbers in the dataset are spread apart.

For instance, let’s consider a set of numbers like 3, 7, 12, 15, and 20. In this case, the highest value is 20, and the lowest is 3. So, the range would be 20 - 3 = 17. This means that the range of this dataset is 17, which implies that the numbers are relatively far apart.

However, it’s worth noting that the range doesn’t provide information about how the values are distributed within that range. For this reason, other measures, such as standard deviation or interquartile range, might be used to determine the distribution of values within the range. These measures can provide a more comprehensive understanding of the dataset’s characteristics, allowing you to draw more accurate conclusions about the data.

Code

# Calculate range using range() function
result_range <- diff(range(df$GY))
result_range

[1] 60.58729

Code

# Calculate range by subtracting maximum and minimum values
result_range <- max(df$GY) - min(df$GY)
result_range

[1] 60.58729

Variance

Variance is a statistical concept used to measure the degree of variability among the values in a given dataset. In simpler terms, it measures how far the actual values in a dataset are from the average value of the dataset. A high variance indicates that the values are widely spread out from the mean, while a low variance indicates that the values are closely clustered around the mean. Variance is a crucial metric in data analysis as it enables us to understand the data distribution and make informed decisions based on the insights obtained.

To calculate the variance:

Find the mean of the data set.
Subtract the mean from each data point to find the difference.
Square each difference.
Find the mean of those squared differences.

Variance (\(\sigma^2\)) or (\(s^2\)):

Population Variance:

\[ \sigma^2 = \frac{\sum_{i=1}^{n}(X_i - \mu)^2}{n} \]

Sample Variance:

\[ s^2 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n-1} \]

For population variance (\(\sigma^2\)), (\(X_i\)) represents individual data points, is the population mean (\(\mu\)), and (\(n\)) is the number of data points.
For sample variance (\(s^2\)), (\(X_i\)) represents individual data points, (\(\bar{X}\)) is the sample mean,

In R, you can calculate the variance of a dataset using the var() function

Code

var(df$GY)

[1] 180.5147

Standard Deviation

The standard deviation is a statistical measure that provides information about the amount of variation or dispersion present in a set of values. It is a metric that quantifies the degree to which the values in a dataset are spread out around the mean. A higher standard deviation indicates that the values are more widely distributed, while a lower standard deviation indicates that the values are clustered more closely around the mean. Standard deviation is an important tool in data analysis because it helps to identify outliers and understand the shape of the distribution of values in a dataset.

To calculate the standard deviation:

Find the mean of the data set.
For each data point, find the difference between it and the mean.
Square each of these differences.
Find the mean of the squared differences.
Take the square root of this mean.

Standard Deviation (\(\sigma\)) or (\(s\)): -

Population Standard Deviation: \[ \sigma = \sqrt{\sigma^2} \] -Sample Standard Deviation: \[ s = \sqrt{s^2} \]

Explanation:
- For population standard deviation (\(\sigma\)), it is the square root of the population variance.
- For sample standard deviation (\(s\)), it is the square root of the sample variance.

In R, you can calculate the standard deviation of a dataset using the sd() function.

Code

sd(df$GY)

[1] 13.43558

Quantile

Quantiles are statistical measures used to divide a given dataset into equal portions. They are mainly used to analyze data by dividing it into smaller segments based on their relative position in the entire dataset. For instance, the median is a type of quantile that represents the 50th percentile value, which splits the data into two equal halves. Similarly, other common quantiles include quartiles, which are used to divide data into four equal parts, and deciles, which divide data into ten equal parts. These measures enable analysts to better understand the distribution of data and identify patterns and trends that may not be apparent when looking at the data as a whole.

Code

# Finding specific quantiles
q_25 <- quantile(df$GY, 0.25)  # 25th percentile (first quartile)
q_50 <- quantile(df$GY, 0.50)   # 50th percentile (median)
q_75 <- quantile(df$GY, 0.75)  # 75th percentile (third quartile)


q_25

     25% 
18.72015

Code

q_50

     50% 
25.31435

Code

q_75

     75% 
39.95412

Code

# quartiles
quantile(df$GY)

       0%       25%       50%       75%      100% 
 4.749097 18.720147 25.314346 39.954118 65.336384

Code

# deciles
quantile(df$GY, prob=seq(0, 1, by=0.1))

       0%       10%       20%       30%       40%       50%       60%       70% 
 4.749097 13.766805 16.922771 19.706930 21.885601 25.314346 29.763025 35.715022 
      80%       90%      100% 
43.021497 47.105559 65.336384

Interquartile range (IQR)

The Interquartile Range (IQR) is a statistical tool used to measure the degree of variation within a dataset. It is calculated by finding the difference between the upper quartile (Q3) and the lower quartile (Q1). Quartiles are values that divide a dataset into quarters, with Q1 representing the 25th percentile and Q3 representing the 75th percentile. The IQR is often used in conjunction with the median to provide a more complete understanding of the spread of data. It is a robust measure of spread because it is not affected by extreme values, or outliers, in the dataset.

Code

# Calculating quartiles
Q1 <- quantile(df$GY, 0.25)  # First quartile (25th percentile)
Q3 <- quantile(df$GY, 0.75)  # Third quartile (75th percentile)

# Calculating IQR
IQR_value <- Q3 - Q1

IQR_value

     75% 
21.23397

Summary and Conclusion

This tutorial guides descriptive statistics using R. We explain the basic concepts of central tendency, variability, and distribution and how to calculate key descriptive statistics using R functions. Descriptive statistics are essential for making data-driven decisions, identifying trends, and communicating insights to a broader audience. Remember, descriptive statistics can help you tell meaningful stories with your data. With a solid understanding of descriptive statistics in R, you can make informed decisions based on a deeper understanding of your data.

References

Here are some excellent resources for learning basic statistics with R:

Descriptive Statistics with R by Andrea B. Hollingsworth & David W. Gerbing: A concise guide focused purely on descriptive statistics, including tables, graphs, and summary metrics. Includes R code for real datasets.
Data Visualization and Descriptive Statistics with R by Alboukadel Kassambara: Part of the Practical Guide series by the creator of ggpubr. Focuses on visualizing and summarizing data with ggplot2 and base R.
R in Action by Robert I. Kabacoff: Chapter 6 covers descriptive statistics, including measures of central tendency, dispersion, and visualization techniques (e.g., bar charts, histograms).
Introduction to basic statistics with R
Introduction to statistics with R