Adapted from here.
One-Way Analysis of Variance (ANOVA) is a method for comparing the means of \(a\) populations. This kind of problem arises in two different settings.
Assumptions required for One-Way ANOVA:
The assumptions are conveniently summarized in the following statistical model:
\[ X_{ij} = \mu_i + e_{ij} \]
where \(e_{ij}\) are i.i.d. \(N(0,\sigma^2)\), \(\;i=1,2,\ldots,a\), \(\;j=1,2,\ldots,n_i\).
Example: Tests were conducted to compare three top brands of golf balls for mean distance traveled when struck by a driver. A robotic golfer was employed with a driver to hit a random sample of 5 golf balls of each brand in a random sequence. Distance traveled, in yards, for each hit is shown in the table below.
Brand A | Brand B | Brand C |
---|---|---|
251.2 | 263.2 | 269.7 |
245.1 | 262.9 | 263.2 |
248.0 | 265.0 | 277.5 |
251.1 | 254.5 | 267.4 |
260.5 | 264.3 | 270.5 |
Suppose we want to compare the mean distance traveled by the three brands of golf balls based on the three samples. One-Way ANOVA provides a method to accomplish this.
The hypotheses of interest in One-Way ANOVA are:
\[ \begin{aligned} H_0&:\mu_1=\mu_2=\ldots=\mu_a\\ H_A&:\mu_i\neq\mu_j\text{ for some $i$, $j$} \end{aligned} \]
The total variability in the response, \(X_{ij}\) is partitioned into between treatment and within treatment (error) components. When these component values are squared and summed over all the observations, terms called sums of squares are produced. There is an additive relation which states that the total sum of squares equals the sum of the treatment and error sum of squares.
\[ SS_{Total}=SST+SSE \]
The notations \(SST\), \(SSTr\), \(SSTrt\), \(SS_{Tr}\), \(SStreatment\), and \(SS(Between)\) are synonymous for “treatment sum of squares”. The abbreviations \(SSE\), \(SSerror\), and \(SS(Within)\) are synonymous for “error sum of squares”. Associated with each sum of squares is its degrees of freedom. The total degrees of freedom is \(n-1\). The treatment degrees of freedom is \(a-1\) and the error degrees of freedom is \(n-a\). The degrees of freedom satisfy an an additive relationship, as did the sums of squares.
\[ n-1=(a-1)+(n-a) \]
Scaled versions of the treatment and error sums of squares (the sums of squares divided by their associated degrees of freedom) are known as mean squares:
\[ \begin{aligned} MST&=SST/(a-1)\\ MSE&=SSE/(n-a) \end{aligned} \]
\(MST\) and \(MSE\) are both estimates of the error variance, \(\sigma^2\). \(MSE\) is always unbiased (its mean equals \(\sigma^2\)), while \(MST\) is unbiased only when the null hypothesis is true. When the alternative \(H_A\) is true, \(MST\) will tend to be larger than \(MSE\). The ratio of the mean squares is \(F=MST/MSE\). This should be close to \(1\) when \(H_0\) is true, while large values of \(F\) provide evidence against \(H_0\). The null hypothesis \(H_0\) is rejected for large values of the observed test statistic \(F_{obs}\).
ANOVA calculations are conveniently displayed in the tabular form shown below, which is known as an ANOVA table.
Source | \(SS\) | \(df\) | \(MS\) | \(F_{obs}\) | \(p\)-value |
---|---|---|---|---|---|
Treatments | \(SST\) | \(a\!-\!1\) | \(MST\) | \(MST/MSE\) | \(P[F\geq Fobs]\) |
Error | \(SSE\) | \(n\!-\!a\) | \(MSE\) | ||
Total | \(SS_{Total}\) | \(n\!-\!1\) |
Here are the formulas for sums of squares. We will see that there are simpler formulas when we know the sample means and sample variances for each of the a groups.
\[ \begin{aligned} SS_{Total}&=\sum_{i=1}^a\sum_{j=1}^{n_i}(x_{ij}-\bar{x}_{..})^2\\ SST&=\sum_{i=1}^a\sum_{j=1}^{n_i}(\bar{x}_{i.}-\bar{x}_{..})^2=\sum_{i=1}^an_i(\bar{x}_{i.}-\bar{x}_{..})^2\\ SSE&=\sum_{i=1}^a\sum_{j=1}^{n_i}(x_{ij}-\bar{x}_{i.})^2=\sum_{i=1}^a(n_i-1)s_i^2 \end{aligned} \]
The test statistic is
\[ F_{obs}=\frac{SST/(a-1)}{SSE/(n-a)} \]
and the p-value is \(P(F\geq F_{obs})\).
Note:
Consider again the tests conducted to compare three brands of golf balls for mean distance traveled when struck by a driver. Again, distances traveled, in yards, for each hit and are shown below.
Brand A | Brand B | Brand C |
---|---|---|
251.2 | 263.2 | 269.7 |
245.1 | 262.9 | 263.2 |
248.0 | 265.0 | 277.5 |
251.1 | 254.5 | 267.4 |
260.5 | 264.3 | 270.5 |
Here we’ll compare the mean distance traveled by the different brands of golf balls using ANOVA.
\(i\) | \(\bar{x}_{i.}\) | \(s_i^2\) | \(n_i\) |
---|---|---|---|
1 | 251.18 | 33.487 | 5 |
2 | 261.98 | 18.197 | 5 |
3 | 269.66 | 27.253 | 5 |
Source | \(SS\) | \(df\) | \(MS\) | \(F_{obs}\) | \(p\)-value |
---|---|---|---|---|---|
Treatments | \(861.89\) | \(2\) | \(430.94\) | \(16.378\) | \(0.0003715\) |
Error | \(315.75\) | \(12\) | \(26.31\) | ||
Total | \(1177.64\) | \(14\) |
1-pf(16.37,2,12)
\(\approx3.723\times 10^{-4}\)R
library(tidyverse)
# create data frame
golf = data.frame(A = c(251.2,245.1,248.0,251.1,260.5),
B = c(263.2,262.9,265.0,254.5,264.3),
C = c(269.7,263.2,277.5,267.4,270.5))
print(golf)
## A B C
## 1 251.2 263.2 269.7
## 2 245.1 262.9 263.2
## 3 248.0 265.0 277.5
## 4 251.1 254.5 267.4
## 5 260.5 264.3 270.5
# convert to long format
golf.long = golf %>% pivot_longer(1:3,"brand",values_to="distance") %>% arrange(brand)
print(golf.long,n=Inf)
## # A tibble: 15 x 2
## brand distance
## <chr> <dbl>
## 1 A 251.
## 2 A 245.
## 3 A 248
## 4 A 251.
## 5 A 260.
## 6 B 263.
## 7 B 263.
## 8 B 265
## 9 B 254.
## 10 B 264.
## 11 C 270.
## 12 C 263.
## 13 C 278.
## 14 C 267.
## 15 C 270.
# make ANOVA table
aov.golf = aov(distance ~ brand, data=golf.long)
summary(aov.golf)
## Df Sum Sq Mean Sq F value Pr(>F)
## brand 2 861.9 430.9 16.38 0.000372 ***
## Residuals 12 315.7 26.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Check if assumptions are violated:
par(mfrow=c(1,2))
plot(aov.golf,which=1:2,ask=F)
Best for all pairwise comparisons. Easy to apply by using TukeyHSD()
function on result of aov()
model.
TukeyHSD(aov.golf, conf.level=.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = distance ~ brand, data = golf.long)
##
## $brand
## diff lwr upr p adj
## B-A 10.80 2.1448757 19.45512 0.0153590
## C-A 18.48 9.8248757 27.13512 0.0002718
## C-B 7.68 -0.9751243 16.33512 0.0841836
plot(TukeyHSD(aov.golf, conf.level=.95))
Best for making a smaller set of planned comparisons. For example, suppose Brand A is a control, and Brand B and Brand C are new prototype ball designs, and we are only interested in comparing A with B and A with C (this isn’t the best example since there’s only 3 groups so there aren’t many possible combinations; in general you only want to test a small fraction of the possible tests with Bonferroni correction).
Probably the easiest way is to run pairwise.t.test()
with p.adjust.method="none"
and just manually compare the \(p\)-values to \(\alpha/m\) (where \(m\) is the number of tests you planned to run).
pairwise.t.test(golf.long$distance, golf.long$brand, p.adjust.method="none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: golf.long$distance and golf.long$brand
##
## A B
## B 0.006 -
## C 1e-04 0.036
##
## P value adjustment method: none
The assumptions of the usual one way ANOVA are:
\[ X_{ij} = \mu_i + e_{ij} \]
where \(e_{ij}\) are i.i.d. \(N(0,\sigma^2)\), \(\;i=1,2,\ldots,a\), \(\;j=1,2,\ldots,n_i\).
If there is strong evidence of non-normality in the residuals, then one way ANOVA may not give a valid \(p\)-value. In such cases, the Kruskal-Wallis test is useful. This is the analogue of the Wilcoxon test for more than two populations.
The process is rather complicated, and you will NOT be expected to remember this, but in case you are curious, here is the equation:
\[ H=(n-1)\frac{\sum_{i=1}^a n_i(\bar{r}_{i\cdot}-\bar{r})^2}{\sum_{i=1}^a\sum_{j=1}^{n_i}(r_{ij}-\bar{r})^2} \]
Where \(r_{ij}\) is the rank of \(x_{ij}\) in entire sample (across all groups), \(\bar{r}_{i.}\) is the mean rank of all observations in group \(i\), and \(\bar{r}\) is the mean of all \(r_{ij}\). The result of this under the null distribution has a \(\chi^2\) distribution with \(a-1\) degrees of freedom.
In R
, this is done with:
kruskal.test(distance~brand, data=golf.long)
##
## Kruskal-Wallis rank sum test
##
## data: distance by brand
## Kruskal-Wallis chi-squared = 10.864, df = 2, p-value = 0.004373
Similar to other nonparametric methods, the lack of normality assumption in the Kruskal-Wallis test means it’s more widely applicable than ANOVA, but at the cost of a bit of power. This is also a common theme in statistics. When you can make assumptions about the distributions of data, you have more information about it, and can therefore detect smaller effect sizes, giving you more power. This is why parametric methods like ANOVA are so popular. However, when those assumptions are found to be violated, you settle for a nonparametric method. Even though it gives slightly less power, it’s the best you have.
Example: Tests were conducted to assess the effects of two factors, engine type, and propellant type, on propellant burn rate in fired missiles. Three engine types and four propellant types were tested.
Twenty-four missiles were selected from a large production batch. The missiles were randomly split into three groups of size eight. The first group of eight had engine type 1 installed, the second group had engine type 2, and the third group received engine type 3.
Each group of eight was randomly divided into four groups of two. The first such group was assigned propellant type 1, the second group was assigned propellant type 2, and so on. Data on burn rate were collected, as follows:
\[ \begin{array}{r|cccc} \text{Engine}&\hfill\text{Prop.}&\text{type}\hfill&&\\ \text{type}&1&2&3&4\\ \hline 1&34.0&30.1&29.8&29.0\\ &32.7&32.8&26.7&28.9\\ \hline 2&32.0&30.2&28.7&27.6\\ &33.2&29.8&28.1&27.8\\ \hline 3&28.4&27.3&29.7&28.8\\ &29.3&28.9&27.3&29.1\\ \end{array} \]
We want to determine whether either factor, engine type (factor A) or propellant type (factor B), has a significant effect on burn rate.
Let \(Y_{ijk}\) denote the \(k\)-th observation at the \(i\)-th level of factor A and the \(j\)-th level of factor B. The two factor model (with interaction) is.
\[ Y_{ijk}=\mu+\alpha_i+\beta_j+\gamma_{ij}+\epsilon_{ijk} \]
where
Here we have included an interaction term, which is fairly common in this method. The interaction term allows one variable to affect the influence of the other variable. Without the interaction, whether A is 1, 2, or 3 would not be allowed to affect in any way the ability of variable B to predict the outcome. In other words, their abilities to predict the outcome are totally
The mean of \(Y_{ijk}\) is
\[ E[Y_{ijk}]=\mu+\alpha_i+\beta_j+\gamma_{ij} \]
Note the sum constraints ensure that there is a unique correspondence between the parameters (the \(\alpha\)’s, \(\beta\)’s, \(\gamma\)’s and \(\mu\)) and the means of the random variables (the \(\mu_{ijk}\)’s).
In the two-way ANOVA model, we compute a sum square for each factor, as well as the interaction. The formulae are slightly more complicated than one-way, but the intuition is basically the same. We want to compare the variation within to the variation between.
Again, we are only considering the balanced case. Below are the sum square formulae for main effect A, main effect B, interaction effect AB, error, and total:
\[ \begin{aligned} SSA&=JK\sum_i(\bar{y}_{i..}-\bar{y}_{...})^2&&\text{df: }I-1\\ SSB&=IK\sum_i(\bar{y}_{.j.}-\bar{y}_{...})^2&&\text{df: }J-1\\ SSAB&=K\sum_{i,j}(\bar{y}_{ij.}-\bar{y}_{i..}-\bar{y}_{.j.}+\bar{y}_{...})^2&&\text{df: }(I-1)(J-1)\\ SSE&=\sum_{i,j,k}(y_{ijk}-\bar{y}_{ij.})^2&&\text{df: }IJ(K-1)\\ SS_{Total}&=\sum_{i,j,k}(y_{ijk}-\bar{y}_{...})^2&&\text{df: }IJK-1\\ \end{aligned} \]
where the dot has the usual meaning of taking the mean over that index. These formulae are significantly more complex, so we are not expecting you to be able to do them manually but you should understand their definition and be able to recognize them.
Each component SS can be normalized by dividing by its degrees of freedom, and then each effect SS can produce its own \(F\)-statistic by dividing by the SSE. To save time, we are not going to show all this manually; we’re just going to make the ANOVA table with R
.
missiles.long = read.csv(text="
34.0,30.1,29.8,29.0
32.7,32.8,26.7,28.9
32.0,30.2,28.7,27.6
33.2,29.8,28.1,27.8
28.4,27.3,29.7,28.8
29.3,28.9,27.3,29.1", header=F) %>%
pivot_longer(1:4,names_to=NULL,values_to="rate") %>%
mutate(engine = as.character(rep(1:3,each=8)),
propellant = as.character(rep(1:4,length.out=24)))
head(missiles.long,12)
## # A tibble: 12 x 3
## rate engine propellant
## <dbl> <chr> <chr>
## 1 34 1 1
## 2 30.1 1 2
## 3 29.8 1 3
## 4 29 1 4
## 5 32.7 1 1
## 6 32.8 1 2
## 7 26.7 1 3
## 8 28.9 1 4
## 9 32 2 1
## 10 30.2 2 2
## 11 28.7 2 3
## 12 27.6 2 4
aov.missiles = aov(rate~engine*propellant, data=missiles.long)
summary(aov.missiles)
## Df Sum Sq Mean Sq F value Pr(>F)
## engine 2 14.52 7.262 5.844 0.01690 *
## propellant 3 40.08 13.361 10.753 0.00102 **
## engine:propellant 6 22.16 3.694 2.973 0.05117 .
## Residuals 12 14.91 1.243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observations:
Check residuals just in case.
par(mfrow=c(1,2),cex=.6)
plot(aov.missiles,which=1:2,ask=F)
This can be easily extended to \(N\)-factors. Here we show one final example (only using R
) of a 3-way factorial ANOVA analysis.
The data represents crop yields in a number of plots in a research farm. Yields are measured for different irrigation levels (2 levels: irrigated or not), sowing density (3 levels: low, medium, high), and fertilizer application (3 levels: low, medium, high).
Below is the data:
crops = read_table("https://raw.githubusercontent.com/shifteight/R-lang/master/TRB/data/splityield.txt")
str(crops)
## spec_tbl_df [72 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ yield : num [1:72] 90 95 107 92 89 92 81 92 93 80 ...
## $ block : chr [1:72] "A" "A" "A" "A" ...
## $ irrigation: chr [1:72] "control" "control" "control" "control" ...
## $ density : chr [1:72] "low" "low" "low" "medium" ...
## $ fertilizer: chr [1:72] "N" "P" "NP" "N" ...
## - attr(*, "spec")=
## .. cols(
## .. yield = col_double(),
## .. block = col_character(),
## .. irrigation = col_character(),
## .. density = col_character(),
## .. fertilizer = col_character()
## .. )
head(crops,18)
## # A tibble: 18 x 5
## yield block irrigation density fertilizer
## <dbl> <chr> <chr> <chr> <chr>
## 1 90 A control low N
## 2 95 A control low P
## 3 107 A control low NP
## 4 92 A control medium N
## 5 89 A control medium P
## 6 92 A control medium NP
## 7 81 A control high N
## 8 92 A control high P
## 9 93 A control high NP
## 10 80 A irrigated low N
## 11 87 A irrigated low P
## 12 100 A irrigated low NP
## 13 121 A irrigated medium N
## 14 110 A irrigated medium P
## 15 119 A irrigated medium NP
## 16 78 A irrigated high N
## 17 98 A irrigated high P
## 18 122 A irrigated high NP
Running 3-way ANOVA with all interaction effects:
aov.crops = aov(yield~irrigation*density*fertilizer, data=crops)
summary(aov.crops)
## Df Sum Sq Mean Sq F value Pr(>F)
## irrigation 1 8278 8278 59.575 2.81e-10 ***
## density 2 1758 879 6.328 0.00340 **
## fertilizer 2 1977 989 7.116 0.00181 **
## irrigation:density 2 2747 1374 9.885 0.00022 ***
## irrigation:fertilizer 2 953 477 3.431 0.03956 *
## density:fertilizer 4 305 76 0.549 0.70082
## irrigation:density:fertilizer 4 235 59 0.422 0.79183
## Residuals 54 7503 139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Check residuals just in case.
par(mfrow=c(1,2),cex=.6)
plot(aov.crops,which=1:2,ask=F)