Analysis of Variance (ANOVA) with R

ANOVA is widely used in various fields, including agriculture, biology, psychology, and social sciences, to analyze experimental data and draw conclusions about the effects of different treatments or factors on a response variable. Here you will learn how to perform ANOVA in R.

Introduction

Analysis of Variance (ANOVA) is a statistical technique used to compare the means of three or more groups, enabling us to determine whether there are significant differences between them. This method is widely used in research to help us understand whether the differences we observe among group means are due to genuine differences in the populations or simply random sampling variation.

ANOVA works by breaking down the total variance observed in the data into different components: the variance between group means and the variance within each group. It then examines whether the between-group variance is significantly larger than the within-group variance. If it is, this indicates that there are real differences among the groups.

There are various types of ANOVA, including one-way ANOVA, which compares the means of three or more groups on a single independent variable, and two-way ANOVA, which examines the influence of two independent variables. More complex designs, such as factorial ANOVA, can be used for multiple factors.

During ANOVA, an F-statistic is generated, which compares the ratio of the between-group variance to the within-group variance. If the calculated F-value exceeds a critical value based on the chosen significance level, usually 0.05, it suggests that there is a statistically significant difference between at least two group means.

Post-hoc tests, such as Tukey’s HSD or Bonferroni correction, are often used after ANOVA to pinpoint which specific groups differ from each other if the overall ANOVA result is significant. It’s important to note that ANOVA makes certain assumptions, such as the data following a normal distribution and the groups having equal variances. Violations of these assumptions may affect the reliability of the ANOVA results.

Overall, ANOVA is a powerful tool for comparing means across multiple groups simultaneously. It’s commonly used in various fields, such as psychology, biology, economics, and more to analyze experimental or observational data. By comparing group means, ANOVA helps us to understand the differences between groups and identify significant patterns in the data.

ANOVA tests whether the means of three or more groups are statistically different. It decomposes the total variability into:

Between-Group Variability (SSB): Differences due to group means.
Within-Group Variability (SSW): Differences due to individual observations within groups.
Sum of Squares Between (SSB):

\[ SSB = \sum_{i=1}^k n_i (\bar{x}_i - \bar{x}_{\text{grand}})^2 \]

where \(n_i\) = sample size of group \(i\), \(\bar{x}_i\) = mean of group \(i\), \(\bar{x}_{\text{grand}}\) = grand mean.

Sum of Squares Within (SSW):

\[SSW = \sum_{i=1}^k \sum_{j=1}^{n_i} (x_{ij} - \bar{x}_i)^2\]

Degrees of Freedom:

Between groups: \(df_{\text{between}} = k - 1\)
Within groups: \(df_{\text{within}} = N - k\)

Mean Squares:

\(MS_{\text{between}} = SSB / df_{\text{between}}\)
\(MS_{\text{within}} = SSW / df_{\text{within}}\)

F-Statistic:

\[F = \frac{MS_{\text{between}}}{MS_{\text{within}}}\]

Assumptions

When conducting a research study with multiple groups, it is important to consider several factors to ensure accurate and reliable results. The first consideration is the independence of the observations. Each subject should belong to only one group, and there should be no relationship between the observations in each group. This means that having repeated measures for the same participants is not allowed, as this could introduce bias into the results.

The second consideration is the presence of any significant outliers in any design cell. Outliers are observations that fall far outside the expected range of values and can significantly impact the results of the study. Identifying and removing any outliers before analyzing the data is important to ensure accurate results.

The third consideration is normality. The data for each design cell should be approximately normally distributed, meaning that the data should follow a bell-shaped curve when plotted on a graph. Non-normal data can be transformed to meet this assumption, but this should be done carefully and with consideration of the impact on the results.

Finally, it is important to ensure homogeneity of variances. This means that the variance of the outcome variable should be equal in every design cell. If the variances are not equal, the study results could be affected, and it may be necessary to use alternative statistical methods to analyze the data.

ANOVA from scratch

Generate Synthetic Data

Code

# Set seed for reproducibility
set.seed(123)

# Create synthetic data with 3 groups
group1 <- rnorm(10, mean = 5, sd = 1.5)  # Group 1: mean = 5
group2 <- rnorm(10, mean = 6, sd = 1.5)  # Group 2: mean = 6
group3 <- rnorm(10, mean = 7, sd = 1.5)  # Group 3: mean = 7

# Combine into a data frame
data <- data.frame(
  weight = c(group1, group2, group3),
  group = factor(rep(c("Group1", "Group2", "Group3"), each = 10))
)

# View the first few rows
head(data)

    weight  group
1 4.159287 Group1
2 4.654734 Group1
3 7.338062 Group1
4 5.105763 Group1
5 5.193932 Group1
6 7.572597 Group1

Perform Manual ANOVA

Code

# Extract response and split by group
response <- data$weight
groups <- split(response, data$group)

# Compute grand mean and group statistics
grand_mean <- mean(response)
k <- length(groups)          # Number of groups (3)
n_total <- length(response)  # Total sample size (30)

# Calculate SSB (Between-Group Sum of Squares)
group_means <- sapply(groups, mean)  # Group means: Group1, Group2, Group3
group_n <- sapply(groups, length)    # Sample sizes per group (all 10)
ssb <- sum(group_n * (group_means - grand_mean)^2)

# Calculate SSW (Within-Group Sum of Squares)
ssw <- sum(sapply(groups, function(x) sum((x - mean(x))^2)))

# Degrees of Freedom
df_between <- k - 1   # 3 - 1 = 2
df_within <- n_total - k  # 30 - 3 = 27

# Mean Squares
ms_between <- ssb / df_between
ms_within <- ssw / df_within

# F-Statistic and P-Value
f_value <- ms_between / ms_within
p_value <- pf(f_value, df_between, df_within, lower.tail = FALSE)

# Assemble ANOVA Table
anova_table <- data.frame(
  Source = c("Between Groups", "Within Groups"),
  SS = c(ssb, ssw),
  df = c(df_between, df_within),
  MS = c(ms_between, ms_within),
  F = c(f_value, NA),
  p.value = c(p_value, NA)
)

print(anova_table)

          Source       SS df      MS        F   p.value
1 Between Groups 10.03490  2 5.01745 2.344297 0.1151386
2  Within Groups 57.78755 27 2.14028       NA        NA

Validation with Built-in `aov()`

Code

# Validate using R's built-in ANOVA
summary(aov(weight ~ group, data = data))

            Df Sum Sq Mean Sq F value Pr(>F)
group        2  10.03   5.017   2.344  0.115
Residuals   27  57.79   2.140

ANOVA with R

Check and Install Required Packages

In these exercise we will use following R-Packages:

tydyverse: The tidyverse is a collection of R packages designed for data science.

ggstatsplot: ggplot2 based lots with statistical details

GGally: GGally extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data.

ggExtra:Add marginal histograms to ggplot2

ggside: The package ggside was designed to enable users to add metadata to their ggplots with ease.

patchwork:The goal of patchwork is to make it ridiculously simple to combine separate ggplots into the same graphic.

gridExtra:The grid package provides low-level functions to create graphical objects (grobs), and position them on a page in specific viewports.

rstatix: Provides a simple and intuitive pipe-friendly framework, coherent with the ‘tidyverse’ design philosophy, for performing basic statistical tests, including t-test, Wilcoxon test, ANOVA, Kruskal-Wallis and correlation analyses.

Code

packages <-  c('tidyverse', 
         'rstatix',
         'ggstatsplot',
         'GGally',
         'ggExtra',
         'ggside',
         'patchwork',
         'gridExtra',
         'report',
         'patchwork'
         )

#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Verify Installation

Code

# Verify installation
cat("Installed packages:\n")

Installed packages:

Code

print(sapply(packages, requireNamespace, quietly = TRUE))

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Registered S3 method overwritten by 'ggside':
  method from  
  +.gg   GGally

  tidyverse     rstatix ggstatsplot      GGally     ggExtra      ggside 
       TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
  patchwork   gridExtra      report   patchwork 
       TRUE        TRUE        TRUE        TRUE

Load Packages

Code

# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Check Loaded Packages

Code

# Check loaded packages
cat("Successfully loaded packages:\n")

Successfully loaded packages:

Code

print(search()[grepl("package:", search())])

 [1] "package:report"      "package:gridExtra"   "package:patchwork"  
 [4] "package:ggside"      "package:ggExtra"     "package:GGally"     
 [7] "package:ggstatsplot" "package:rstatix"     "package:lubridate"  
[10] "package:forcats"     "package:stringr"     "package:dplyr"      
[13] "package:purrr"       "package:readr"       "package:tidyr"      
[16] "package:tibble"      "package:ggplot2"     "package:tidyverse"  
[19] "package:stats"       "package:graphics"    "package:grDevices"  
[22] "package:utils"       "package:datasets"    "package:methods"    
[25] "package:base"

Data

In this exercise, we will create a data frame randomly with four treatments, six varieties, and four replications (Total 96 observation).

Code

# Set the number of observations, treatments, replications, and varieties
n <- 1
treatments <- 4
replications <- 4
varieties <- 6
# Create an empty data frame with columns for treatment, replication, variety, and observation
exp.df <- data.frame(
  Treatment = character(),
  Replication = integer(),
  Variety = character(),
  Yield = numeric(),
  stringsAsFactors = FALSE
)
# Generate random data for each treatment, replication, and variety combination
for (t in 1:treatments) {
  for (r in 1:replications) {
    for (v in 1:varieties) {
      # Generate n random observations for the current combination of treatment, replication, and variety
      obs <- rnorm(n, mean = 3*t + 0.95*r + 0.90*v, sd = 0.25)
      
      # Add the observations to the data frame
      set.seeds= 1256
      exp.df <- rbind(exp.df, data.frame(
        Treatment = paste0("T", t),
        Replication = r,
        Variety = paste0("V", v),
        Yield = obs
      ))
    }
  }
}

# Print the first 10 rows of the data frame
glimpse(exp.df)

Rows: 96
Columns: 4
$ Treatment   <chr> "T1", "T1", "T1", "T1", "T1", "T1", "T1", "T1", "T1", "T1"…
$ Replication <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4…
$ Variety     <chr> "V1", "V2", "V3", "V4", "V5", "V6", "V1", "V2", "V3", "V4"…
$ Yield       <dbl> 4.956616, 5.676232, 6.873781, 7.769533, 8.655395, 9.522160…

One-way ANOVA

The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of two or more independent (unrelated) groups. The one-way analysis of variance (ANOVA) is a statistical test used to determine whether there are any statistically significant differences between the means of three or more independent groups.

The formula for the one-way ANOVA F-statistic is as follows:

\[ F = \frac{MSB}{MSW} \]

Where:
- \(F\) is the F-statistic.
- \(MSB\) is the Mean Square Between (variability between group means).
- \(MSW\) is the Mean Square Within (variability within each group).

Mean Square Between (MSB):

\[ MSB = \frac{\sum_{i=1}^{k} n_i(\bar{X}_i - \bar{X})^2}{k-1} \]

Where:
- \(k\) is the number of groups.
- \(n_i\) is the number of observations in group \(i\).
- \(\bar{X}_i\) is the mean of group \(i\).
- \(\bar{X}\) is the overall mean.

Mean Square Within (MSW):

\[ MSW = \frac{\sum_{i=1}^{k} \sum_{j=1}^{n_i} (X_{ij} - \bar{X}_i)^2}{N-k} \]

Where:
- \(N\) is the total number of observations across all groups.
- \(X_{ij}\) is the \(j^{th}\) observation in group \(i\).
- \(\bar{X}_i\) is the mean of group \(i\).

In these formulas, \(\bar{X}_i\) represents the mean of each group, \(n_i\) is the number of observations in each group, and \(k\) is the number of groups. The F-statistic is used to assess whether the variability between group means \(MSB\) is significantly larger than the variability within each group \(MSW\), indicating whether there are significant differences between at least two group means.

The output of an ANOVA test includes an F-statistic, which is a measure of the difference between the groups, and a p-value, which indicates the probability of obtaining the observed difference by chance. If the p-value is less than a chosen significance level (typically 0.05), then the null hypothesis, which states that there is no significant difference between the means of the groups, can be rejected.

ANOVA in R can be performed using the built-in aov() function. This function takes a formula as an argument, where the dependent variable is on the left side of the tilde (~), and the independent variables are on the right side, separated by +. We will use dataset we have created before to see the main effect of treatment on yield.

Code

anova.one=aov (Yield ~Treatment, data = exp.df) # analysis variance 
anova (anova.one)

Analysis of Variance Table

Response: Yield
          Df  Sum Sq Mean Sq F value    Pr(>F)    
Treatment  3 1060.42  353.47  99.535 < 2.2e-16 ***
Residuals 92  326.71    3.55                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A nice and easy way to report results of an ANOVA in R is with the report() function from the report package

Code

report(anova.one)

The ANOVA (formula: Yield ~ Treatment) suggests that:

  - The main effect of Treatment is statistically significant and large (F(3, 92)
= 99.53, p < .001; Eta2 = 0.76, 95% CI [0.70, 1.00])

Effect sizes were labelled following Field's (2013) recommendations.

Two-way ANOVA

Two-way ANOVA (analysis of variance) or factorial ANOVA is a statistical method used to analyze the effects of two categorical independent variables, or factors, or continuous dependent variables. It tests for the main effects of each factor and their interaction effect.

In a two-way ANOVA, we consider the effects of two categorical independent variables on a continuous dependent variable. The general formula for the observed value \(Y_{ijk}\) in a two-way ANOVA is as follows:

\[ Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \epsilon_{ijk} \]

Where: - \(Y_{ijk}\) is the observed value for the \(i^{th}\) level of the first independent variable, the \(j^{th}\) level of the second independent variable, and the \(k^{th}\) observation. - \(\mu\) is the overall mean. - \(\alpha_i\) represents the effect of the \(i^{th}\) level of the first independent variable. - \(\beta_j\) represents the effect of the \(j^{th}\) level of the second independent variable. - \((\alpha \beta)_{ij}\) is the interaction effect between the \(i^{th}\) level of the first independent variable and the \(j^{th}\) level of the second independent variable. - \(\epsilon_{ijk}\) is the random error term.

Now, let’s expand the terms for each effect:

Overall Mean (\(\mu\)):

\[ \mu = \frac{\sum_{i=1}^{a} \sum_{j=1}^{b} \sum_{k=1}^{n} Y_{ijk}}{a \cdot b \cdot n} \] Where: - \(a\) is the number of levels in the first independent variable. - \(b\) is the number of levels in the second independent variable. - \(n\) is the number of observations in each cell.

Main Effects:
- Main Effect of the First Independent Variable (\(\alpha_i\)): \[ \alpha_i = \frac{\sum_{j=1}^{b} \sum_{k=1}^{n} Y_{ijk}}{b \cdot n} - \mu \]
- Main Effect of the Second Independent Variable (\(\beta_j\)): \[ \beta_j = \frac{\sum_{i=1}^{a} \sum_{k=1}^{n} Y_{ijk}}{a \cdot n} - \mu \]
Interaction Effect (\(\alpha \beta)_{ij}\)):

\[ (\alpha \beta)_{ij} = \frac{\sum_{k=1}^{n} Y_{ijk}}{n} - \left(\mu + \alpha_i + \beta_j\right) \]

Error Term (\(\epsilon_{ijk}\)): \[ \epsilon_{ijk} = Y_{ijk} - \left(\mu + \alpha_i + \beta_j + (\alpha \beta)_{ij}\right) \]

In these formulas, \(Y_{ijk}\) represents the observed values, \(\mu\) is the overall mean, \(\alpha_i\) and \(\beta_j\) are the main effects, \((\alpha \beta)_{ij}\)) is the interaction effect, and \(\epsilon_{ijk}\) is the random error term. These effects help explain the variability in the observed values due to the independent variables and their interaction. The goal of two-way ANOVA is to test the significance of these effects.

Here is an example code for a two-way ANOVA with one dependent variable (Yield), and two independent variables (Treatment and Variety) with interaction effect:

Code

anova.two =aov (Yield ~Treatment + Variety+ Treatment:Variety, data = exp.df)
anova (anova.two)

Analysis of Variance Table

Response: Yield
                  Df  Sum Sq Mean Sq  F value    Pr(>F)    
Treatment          3 1060.42  353.47 223.1438 < 2.2e-16 ***
Variety            5  212.16   42.43  26.7870 3.403e-15 ***
Treatment:Variety 15    0.50    0.03   0.0211         1    
Residuals         72  114.05    1.58                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table we can conclude that both Treatment and Variety statistically significant.

Code

report(anova.two)

The ANOVA (formula: Yield ~ Treatment + Variety + Treatment:Variety) suggests
that:

  - The main effect of Treatment is statistically significant and large (F(3, 72)
= 223.14, p < .001; Eta2 (partial) = 0.90, 95% CI [0.87, 1.00])
  - The main effect of Variety is statistically significant and large (F(5, 72) =
26.79, p < .001; Eta2 (partial) = 0.65, 95% CI [0.53, 1.00])
  - The interaction between Treatment and Variety is statistically not
significant and very small (F(15, 72) = 0.02, p > .999; Eta2 (partial) =
4.37e-03, 95% CI [0.00, 1.00])

Effect sizes were labelled following Field's (2013) recommendations.

Multiple Comparisons

After performing an ANOVA, if the overall F-test is significant, we may want to determine which groups differ significantly from each other in terms of the dependent variable. This can be done using post-hoc tests or multiple comparison tests.

When conducting a study with three or more groups and analyzing the data using ANOVA or a similar statistical test, you may encounter a significant difference among the group means. However, this result does not provide information about which specific pairs of group means differ significantly from each other. To identify these differences, you can use multiple comparison procedures.

Multiple comparisons refer to a set of statistical techniques that allow you to conduct pairwise comparisons between the means of different groups in a study. These comparisons are performed after a significant result in ANOVA or a similar test and help you identify which specific groups have significantly different means.

There are several ways to perform multiple comparisons, including pairwise t-tests, Tukey’s HSD test, Scheffe’s test and Bonferroni correction. Each procedure has its strengths and weaknesses, and the choice of which one to use depends on various factors, such as the number of groups being compared and the desired level of statistical significance.

Using multiple comparison procedures is crucial in statistical analysis as it helps you avoid making false conclusions by controlling the overall risk of making a Type I error. By identifying which specific pairs of group means are significantly different from each other, you can gain a more comprehensive understanding of your data, which can inform future research and decision-making.

Tukey’s Honestly Significant Difference (HSD) test

Tukey’s Honestly Significant Difference (HSD) test is a powerful statistical method that is commonly used in experimental research to compare multiple groups and determine if their means significantly differ from each other. One of the major advantages of the HSD test is that it controls the familywise error rate, which means it helps to reduce the likelihood of making a type I error (false positive) when comparing groups. This is achieved by providing a confidence interval that helps to identify which pairs of group means differ significantly from each other, while maintaining an appropriate level of statistical significance. Overall, Tukey’s HSD test is a reliable tool that can help researchers make accurate and informed decisions when analyzing experimental data.

In R, the TukeyHSD() function can be used to perform Tukey’s HSD test, which returns a table of pairwise comparisons between the groups, along with the adjusted p-values:

Code

# one-way anova
tukey.one <- TukeyHSD(anova.one)
tukey.one

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Yield ~ Treatment, data = exp.df)

$Treatment
          diff      lwr       upr   p adj
T2-T1 2.932727 1.509287  4.356167 3.2e-06
T3-T1 6.025073 4.601633  7.448513 0.0e+00
T4-T1 8.877058 7.453618 10.300498 0.0e+00
T3-T2 3.092346 1.668906  4.515785 9.0e-07
T4-T2 5.944331 4.520891  7.367770 0.0e+00
T4-T3 2.851985 1.428545  4.275425 5.9e-06

The table shows the differences between the groups and the 95% confidence intervals for each difference. If the confidence interval does not include zero, the difference between the groups is significant at the specified level.

Code

# two-way ANOVA
tukey.two <- TukeyHSD(anova.two)
head(tukey.two$`Treatment:Variety`, 20)

                  diff        lwr       upr        p adj
T2:V1-T1:V1  3.0695541 -0.2959636  6.435072 1.210456e-01
T3:V1-T1:V1  6.1654189  2.7999012  9.530937 3.980651e-07
T4:V1-T1:V1  8.9696990  5.6041813 12.335217 0.000000e+00
T1:V2-T1:V1  0.9771033 -2.3884144  4.342621 9.999823e-01
T2:V2-T1:V1  3.8914375  0.5259198  7.256955 8.048614e-03
T3:V2-T1:V1  7.0111842  3.6456665 10.376702 6.915243e-09
T4:V2-T1:V1  9.8643864  6.4988686 13.229904 0.000000e+00
T1:V3-T1:V1  1.8972183 -1.4682994  5.262736 8.881355e-01
T2:V3-T1:V1  4.6660843  1.3005665  8.031602 3.585008e-04
T3:V3-T1:V1  7.8381677  4.4726500 11.203685 4.594969e-11
T4:V3-T1:V1 10.6495752  7.2840574 14.015093 0.000000e+00
T1:V4-T1:V1  2.6281910 -0.7373267  5.993709 3.513855e-01
T2:V4-T1:V1  5.8341715  2.4686537  9.199689 1.888423e-06
T3:V4-T1:V1  8.6169490  5.2514313 11.982467 0.000000e+00
T4:V4-T1:V1 11.6481992  8.2826815 15.013717 0.000000e+00
T1:V5-T1:V1  3.5491355  0.1836178  6.914653 2.731840e-02
T2:V5-T1:V1  6.4613318  3.0958140  9.826850 9.755741e-08
T3:V5-T1:V1  9.5661690  6.2006512 12.931687 0.000000e+00
T4:V5-T1:V1 12.4810262  9.1155085 15.846544 0.000000e+00
T1:V6-T1:V1  4.5554825  1.1899647  7.921000 5.722073e-04

Instead of printing the TukeyHSD results in a table, we’ll do it in a graph

Code

plot(tukey.one)

Pairewise t-test

The pairwise.t.test() function can be used to perform Bonferroni correction or other multiple comparison tests. For example, to perform pairwise comparisons using Bonferroni correction, we can use:

Code

pairwise.t.test(exp.df$Yield, exp.df$Treatment,
                 p.adjust.method = "bonferroni", 
                 paired=FALSE)


    Pairwise comparisons using t tests with pooled SD 

data:  exp.df$Yield and exp.df$Treatment 

   T1      T2      T3     
T2 3.2e-06 -       -      
T3 < 2e-16 9.2e-07 -      
T4 < 2e-16 < 2e-16 6.0e-06

P value adjustment method: bonferroni

Box/Violin plots for between-subjects comparisons

We can create a nice looking plots with results of ANOVA and post-hoc tests on the same plot (directly on the boxplots). We will use gbetweenstats() function of ggstatsplot package:

install.packages(“ggstatsplot”)

Code

p1<-ggstatsplot::ggbetweenstats(
  data = exp.df,
  x = Treatment,
  y = Yield,
  ylab = "Yield (t/ha)",
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)+
# add plot title
ggtitle("Mian Effect of Treatment") +
   theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    axis.line = element_line(colour = "gray"),
    # axis title font size
    axis.title.x = element_text(size = 14), 
    # X and  axis font size
    axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5),
    axis.text.x = element_text(size=12))
print(p1)

Code

# Load library
#library(ggstatsplot)

p2<-ggbetweenstats(
  data = exp.df,
  x = Variety,
  y = Yield,
  ylab = "Yield (t/ha)",
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)+
# add plot title
ggtitle("Mian Effect of Variety") +
   theme(
    # center the plot title
    plot.title = element_text(hjust = 0.5),
    axis.line = element_line(colour = "gray"),
    # axis title font size
    axis.title.x = element_text(size = 14), 
    # X and  axis font size
    axis.text.y=element_text(size=12,vjust = 0.5, hjust=0.5),
    axis.text.x = element_text(size=12))
print(p2)

We plot these plots side by side using patchwork package:

install.packages(“patchwork”)

Code

#library(patchwork)
p1+p2

ANOVA with rstatix

anova_test() function in rstatix’ package is an easy to use wrapper around Anova() and aov(). It makes ANOVA computation handy in R and It’s highly flexible: can support model and formula as input. Variables can be also specified as character vector using the arguments dv, wid, between, within, covariate.

One-way ANOVA test

Code

exp.df |> anova_test (Yield ~Treatment)

ANOVA Table (type II tests)

     Effect DFn DFd      F        p p<.05   ges
1 Treatment   3  92 99.535 8.81e-29     * 0.764

Two-way ANOVA test

Code

exp.df |> anova_test (Yield ~Treatment*Variety)

ANOVA Table (type II tests)

             Effect DFn DFd       F        p p<.05   ges
1         Treatment   3  72 223.144 2.26e-36     * 0.903
2           Variety   5  72  26.787 3.40e-15     * 0.650
3 Treatment:Variety  15  72   0.021 1.00e+00       0.004

Two-way repeated measures ANOVA

Code

exp.df |> anova_test (Yield ~Treatment*Variety + Error(Replication/Treatment*Variety))

ANOVA Table (type III tests)

$ANOVA
             Effect DFn DFd         F        p p<.05   ges
1         Treatment   3   9 12490.507 1.33e-16     * 0.903
2           Variety   5  15   716.522 2.74e-17     * 0.650
3 Treatment:Variety  15  45     0.672 7.97e-01       0.004

$`Mauchly's Test for Sphericity`
     Effect     W    p p<.05
1 Treatment 0.091 0.62      

$`Sphericity Corrections`
     Effect   GGe     DF[GG]    p[GG] p[GG]<.05 HFe   DF[HF]    p[HF] p[HF]<.05
1 Treatment 0.479 1.44, 4.31 9.55e-09         * 0.8 2.4, 7.2 1.37e-13         *

Extract ANOVA table and apply correction

Code

res.anova<-exp.df |> anova_test (Yield ~Treatment*Variety + Error(Replication/Treatment*Variety))
get_anova_table(res.anova, correction = "GG")

ANOVA Table (type III tests)

             Effect   DFn   DFd         F        p p<.05   ges
1         Treatment  1.44  4.31 12490.507 9.55e-09     * 0.903
2           Variety  5.00 15.00   716.522 2.74e-17     * 0.650
3 Treatment:Variety 15.00 45.00     0.672 7.97e-01       0.004

Summary and Conclusion

This tutorial explains how to conduct ANOVA analysis in R. ANOVA is a technique used to compare means across groups, popular in fields like business, psychology, medicine, and engineering. It helps identify differences between groups, which can inform decision-making and further research.

Remember the assumptions underlying ANOVA, including normality, homogeneity of variances, and independence. Conduct appropriate post hoc tests to determine which groups differ significantly. Consider the effect size when interpreting ANOVA results, and avoid making causal inferences. Mastering ANOVA analysis in R can help you better understand your data and make more informed decisions.

Analysis of Variance (ANOVA) with R

Introduction

Assumptions

ANOVA from scratch

Generate Synthetic Data

Perform Manual ANOVA

Validation with Built-in aov()

ANOVA with R

Check and Install Required Packages

Verify Installation

Load Packages

Check Loaded Packages

Data

One-way ANOVA

Two-way ANOVA

Multiple Comparisons

Tukey’s Honestly Significant Difference (HSD) test

Pairewise t-test

Box/Violin plots for between-subjects comparisons

ANOVA with rstatix

One-way ANOVA test

Two-way ANOVA test

Two-way repeated measures ANOVA

Extract ANOVA table and apply correction

Summary and Conclusion

References

Validation with Built-in `aov()`