Main notes
Simplest example: 1 random effect
library(tidyverse)
# table copied from loom example in second link
looms = read.table(text="1 98 97 99 96
2 91 90 93 92
3 96 95 97 95
4 95 96 99 98") %>%
column_to_rownames("V1") %>%
t %>%
as.data.frame %>%
remove_rownames %>%
pivot_longer(1:4,names_to="loom",values_to="strength") %>%
arrange(loom)
print(looms,n=Inf)
## # A tibble: 16 x 2
## loom strength
## <chr> <int>
## 1 1 98
## 2 1 97
## 3 1 99
## 4 1 96
## 5 2 91
## 6 2 90
## 7 2 93
## 8 2 92
## 9 3 96
## 10 3 95
## 11 3 97
## 12 3 95
## 13 4 95
## 14 4 96
## 15 4 99
## 16 4 98
library(lme4)
lmer.looms = lmer(strength ~ (1 | loom), data=looms)
summary(lmer.looms)
## Linear mixed model fit by REML ['lmerMod']
## Formula: strength ~ (1 | loom)
## Data: looms
##
## REML criterion at convergence: 63.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.38018 -0.57260 -0.04342 0.82574 1.52491
##
## Random effects:
## Groups Name Variance Std.Dev.
## loom (Intercept) 6.958 2.638
## Residual 1.896 1.377
## Number of obs: 16, groups: loom, 4
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 95.438 1.363 70.01
fixef(lmer.looms)
## (Intercept)
## 95.4375
ranef(lmer.looms)
## $loom
## (Intercept)
## 1 1.9309741
## 2 -3.6864051
## 3 0.2925718
## 4 1.4628591
##
## with conditional variances for "loom"
confint(profile(lmer.looms))
## 2.5 % 97.5 %
## .sig01 1.1293759 5.748927
## .sigma 0.9672533 2.184005
## (Intercept) 92.4393299 98.435668
par(mfrow=c(1,2),cex=.6)
plot(fitted(lmer.looms),resid(lmer.looms),main="Resid v. fitted")
qqnorm(resid(lmer.looms))
# getting equivalent ANOVA table by running anova() of fixed-effect version
anova(aov(strength ~ loom, data=looms))
## Analysis of Variance Table
##
## Response: strength
## Df Sum Sq Mean Sq F value Pr(>F)
## loom 3 89.188 29.7292 15.681 0.0001878 ***
## Residuals 12 22.750 1.8958
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example 2: mixed effect model
This example adapted from here.
This dataset was collected from an experiment done on how someone’s pitch (i.e. frequency of sound) changed depending on politeness and scenario 1. Subjects were asked to imagine a certain scenario and then asked to read a line in either a formal or an informal register. From the original study:
In the design that we used, there’s an additional source of non-independence that needs to be accounted for: We had different items. One item, for example, was an “asking for a favor” scenario. Here, subjects had to imagine asking a professor for a favor (polite condition), or asking a peer for a favor (informal condition). Another item was an “excusing for coming too late” scenario, which was similarly divided between polite and informal. In total, there were 7 such different items.
Below are two plots showing the mean pitch for different subjects and also for different items (scenarios) that were provided.
In the original paper, the researcher decided to fit politeness and sex as fixed effects (because they represent two specific levels of interest that could bet reated as having fixed, constant values over the entire population), and to fit the subjects and items as random effects (because these represent a sample of non-specific levels whose specific values are NOT interesting, and which are drawn from a group of many other possible subjects/items). They arrived at a model equation like this:
pitch ~ politeness + sex + (1 | subject) + (1 | item) + e
This is called a mixed effects model due to the presence of both random and fixed effects. The flexibility of these mixed effect model structures allows you to more accurately represent the experimental design and thus the variance structure of the data.
Boxplots showing the breakdown of mean pitch by politeness and sex.
politeness = read_csv("https://pastebin.com/raw/SC6Likfa")
lmer.polite = lmer(frequency~attitude+gender+(1|subject)+(1|scenario), data=politeness)
summary(lmer.polite)
## Linear mixed model fit by REML ['lmerMod']
## Formula: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
## Data: politeness
##
## REML criterion at convergence: 775.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.2591 -0.6236 -0.0772 0.5388 3.4795
##
## Random effects:
## Groups Name Variance Std.Dev.
## scenario (Intercept) 219.5 14.81
## subject (Intercept) 615.6 24.81
## Residual 645.9 25.41
## Number of obs: 83, groups: scenario, 7; subject, 6
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 256.846 16.116 15.938
## attitudepol -19.721 5.584 -3.532
## genderM -108.516 21.013 -5.164
##
## Correlation of Fixed Effects:
## (Intr) atttdp
## attitudepol -0.173
## genderM -0.652 0.004
ranef(lmer.polite)
## $scenario
## (Intercept)
## 1 -13.506509
## 2 6.582894
## 3 11.407831
## 4 20.629415
## 5 -1.936030
## 6 -12.173903
## 7 -11.003698
##
## $subject
## (Intercept)
## F1 -13.907651
## F2 10.419110
## F3 3.488541
## M3 28.382033
## M4 5.378484
## M7 -33.760517
##
## with conditional variances for "scenario" "subject"
fixef(lmer.polite)
## (Intercept) attitudepol genderM
## 256.84627 -19.72111 -108.51635
confint(profile(lmer.polite))
## 2.5 % 97.5 %
## .sig01 6.864421 29.99058
## .sig02 11.420516 42.73985
## .sigma 21.599106 30.05092
## (Intercept) 226.084997 287.60823
## attitudepol -30.739625 -8.69669
## genderM -149.340247 -67.68308
Unfortunately, computing \(p\)-values becomes slightly more complicated, and there’s different approaches. One of the easier approaches is to use a likelihood ratio test. We fit two versions of the model, one with the term we want to test, and one without, and then use the anova()
function to compare the two. For lmer()
fits, make sure to set REML=FALSE in the fit (two fits with REML=TRUE cannot be compared against each other).
For this example, we wish to find if politeness affects pitch, so we can compare the following models against each other:
lmer.polite.reduced = lmer(frequency~gender+(1|subject)+(1|scenario), data=politeness, REML=FALSE)
lmer.polite.full = lmer(frequency~attitude+gender+(1|subject)+(1|scenario), data=politeness, REML=FALSE)
anova(lmer.polite.reduced,lmer.polite.full)
## Data: politeness
## Models:
## lmer.polite.reduced: frequency ~ gender + (1 | subject) + (1 | scenario)
## lmer.polite.full: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lmer.polite.reduced 5 816.72 828.81 -403.36 806.72
## lmer.polite.full 6 807.10 821.61 -397.55 795.10 11.618 1 0.0006532
##
## lmer.polite.reduced
## lmer.polite.full ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This could be interpreted in the following way:
“… politeness significantly affected pitch (\(\chi^2_1=11.62\), \(p=0.00065\)), lowering it by about 19.7 Hz \(\pm\) 5.6 (standard errors) …”
par(mfrow=c(1,2),cex=.6)
plot(fitted(lmer.polite),resid(lmer.polite),main="Resid v. fitted")
qqnorm(resid(lmer.polite))
Some other examples to briefly scan over
- Example using
glmer()
to fit a mixed effects logistic regression model to predict - Nature paper discussion flexibility of mixed effects models
- Example of using
lmer()
to analyze the famous sleep deprivation dataset, which has a longitudinal component
Brief detour: Likelihood Ratio Test
largely adapted from source
The likelihood ratio test is a very broad and generalizable method of testing hypotheses that is extremely versatile.
Define null hypothesis \(H_{0}: \boldsymbol{\theta} \in \Theta^{\star}\) (meaning the parameter \(\theta\) (which is allowed to be a vector) is in the restricted (i.e. nested) set \(\Theta^{\star}\)) against the alternative hypothesis \(H_{1}: \boldsymbol{\theta} \in \Theta \backslash \Theta^{\star}\) (meaning \(\theta\) is allowed to be in the more general set \(\Theta\backslash \Theta^{\star}\)).
Then, define the following likelihood ratio test statistic where \(L(\boldsymbol{\theta} \mid \boldsymbol{y})\) represents the likelihood function for \(\theta\).
\[ \lambda(\boldsymbol{y})=\frac{\max _{\boldsymbol{\theta} \in \Theta^{\star}} L(\boldsymbol{\theta} \mid \boldsymbol{y})}{\max _{\boldsymbol{\theta} \in \Theta} L(\boldsymbol{\theta} \mid \boldsymbol{y})} \]
We can use this to define a test where we choose some value \(a\) such that rejecting a true null hypothesis when \(\lambda(\boldsymbol{Y})\leq a\) has probability \(\alpha\). In other words, we choose some threshold critical value \(a\) that controls the type I error rate at \(\alpha\)
\[ P\left(\lambda(\boldsymbol{Y}) \leq a \mid H_{0}\right)=\alpha \]
If we let \(\hat{\boldsymbol\theta}\) denote the maximum likelihood estimate of \(\boldsymbol{\theta}\) and let \(\widehat{\boldsymbol{\theta}}_{0}\) denote the value of \(\boldsymbol\theta\) which maximizes the likelihood over all values of \(\boldsymbol{\theta}\) in \(\Theta^{\star}\), then we may write
\[ \lambda(\boldsymbol{y})=\frac{L(\hat{\boldsymbol{\theta}}_{0} \mid \boldsymbol{y})}{L(\hat{\boldsymbol{\theta}} \mid \boldsymbol{y})} \]
The quantity \(\widehat{\boldsymbol\theta}_{0}\) is called the restricted maximum likelihood estimate of \(\boldsymbol\theta\) under \(H_{0} .\)
Example 1: Normal distribution hypothesis test
Suppose that \(Y_{i} \stackrel{i.i.d}{\sim}{\mathcal{N}}\left(\mu, \sigma^{2}\right)\) and consider testing \(H_{0}: \mu=\mu_{0}\) against \(H_{1}: \mu \neq \mu_{0}\). Then the likelihood is
\[ L\left(\mu, \sigma^{2} \boldsymbol{y}\right)=\left(2 \pi \sigma^{2}\right)^{-\frac{n}{2}} \exp \left\{-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}\right\} \]
The maximum likelihood estimate of \(\boldsymbol{\theta}=\left(\mu, \sigma^{2}\right)^{\mathrm{T}}\) is \(\widehat{\boldsymbol{\theta}}=\left(\widehat{\mu}, \widehat{\sigma^{2}}\right)^{\mathrm{T}}\), where \(\widehat{\mu}=\bar{y}\) and \(\widehat{\sigma}^{2}=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} / n\).
Similarly, the restricted maximum likelihood estimate of \(\boldsymbol{\theta}=\left(\mu, \sigma^{2}\right)^{\mathrm{T}}\) under \(H_{0}\) is \(\widehat{\boldsymbol{\theta}}_{0}=\left(\widehat{\mu}_{0}, \widehat{\sigma^{2}}_{0}\right)^{\mathrm{T}}\), where \(\widehat{\mu}_{0}=\mu_{0}\) and \(\widehat{\sigma}_{0}^{2}=\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2} / n .\)
Thus, the generalized likelihood ratio is
\[ \begin{aligned} \lambda(\boldsymbol{y})=\frac{L\left(\mu_{0}, \hat{\sigma}_{0}^{2} \mid \boldsymbol{y}\right)}{L\left(\hat{\mu}, \hat{\sigma}^{2} \mid \boldsymbol{y}\right)} &=\frac{\left(2 \pi \hat{\sigma}_{0}^{2}\right)^{-\frac{n}{2}} \exp \left\{-\frac{1}{2 \hat{\sigma}_{0}^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}\right\}}{\left(2 \pi \hat{\sigma}^{2}\right)^{-\frac{n}{2}} \exp \left\{-\frac{1}{2 \hat{\sigma}^{2}} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}\right\}} \\ &=\left(\frac{\hat{\sigma}^{2}}{\hat{\sigma}_{0}^{2}}\right)^{\frac{n}{2}} \exp \left\{\frac{n \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{2 \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}-\frac{n \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}{2 \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}\right\} \\ &=\left\{\frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}\right\}^{\frac{n}{2}} \end{aligned} \]
Then, there exists some \(a\) such that we reject \(H_{0}\) if
\[ \left\{\frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}\right\}^{\frac{n}{2}} \leq a \Rightarrow \frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}} \leq b \]
where \(a\) and \(b\) are constants chosen to give significance level \(\alpha\). Now, we may write
\[ \begin{aligned} \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2} &=\sum_{i=1}^{n}\left\{\left(y_{i}-\bar{y}\right)+\left(\bar{y}-\mu_{0}\right)\right\}^{2} \\ &=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}+2\left(\bar{y}-\mu_{0}\right) \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)+n\left(\bar{y}-\mu_{0}\right)^{2} \\ &=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}+n\left(\bar{y}-\mu_{0}\right)^{2} \end{aligned} \]
So we reject \(H_{0}\) if
\[ \frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}+n\left(\bar{y}-\mu_{0}\right)^{2}} \leq b \]
Thus, rearranging, we reject \(H_{0}\) if
\[ 1+\frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}} \geq c \Rightarrow \frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}} \geq d, \]
where \(c\) and \(d\) are constants chosen to give significance level \(\alpha\), that is we can write
\[ \alpha=P\left(\lambda(\boldsymbol{Y}) \leq a \mid H_{0}\right)=P\left(\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{S^{2}} \geq d \mid H_{0}\right) \]
where \(S^{2}=\frac{1}{n-1} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}\).
To get \(d\) we need to work out the distribution of \(\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{S^{2}}\) under the null hypothesis. For \(Y_{i} \sim \mathcal{N}\left(\mu, \sigma^{2}\right)\) we have
\[ \bar{Y} \sim N\left(\mu, \frac{\sigma^{2}}{n}\right) \]
and so, under \(H_{0}\),
\[ \bar{Y} \sim \mathcal{N}\left(\mu_{0}, \frac{\sigma^{2}}{n}\right) \quad \text {and}\quad \frac{\sqrt{n}\left(\bar{Y}-\mu_{0}\right)}{\sqrt{\sigma^{2}}} \sim \mathcal{N}(0,1) . \]
By Cochran’s theorem, we have
\[ \frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{\sigma^{2}} \sim \chi_{1}^{2} \quad\text{and}\quad \frac{(n-1) S^{2}}{\sigma^{2}} \sim \chi_{n-1}^{2} \]
Now we may use the fact that if \(U\) and \(V\) are independent rvs such that \(U \sim \chi_{\nu_{1}}^{2}\) and \(V \sim \chi_{\nu_{2}}^{2}\), then \(\frac{U / \nu_{1}}{V / \nu_{2}} \sim \mathcal{F}_{\nu_{1}, \nu_{2}}\). Here, \(U=\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{\sigma^{2}}\) and \(V=\frac{(n-1) S^{2}}{\sigma^{2}}\). Hence, if \(H_{0}\) is true, we have
\[ F=\frac{U / 1}{V /(n-1)}=\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{S^{2}} \sim \mathcal{F}_{1, n-1} . \]
Therefore, we reject \(H_{0}\) at a significance level \(\alpha\) if
\[ \frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{s^{2}} \geq F_{1, n-1, \alpha} \]
where \(F_{1, n-1, \alpha}\) is such that \(P\left(F \geq F_{1, n-1, \alpha}\right)=\alpha\). Using the fact that an \(F\) distribution with \(1\) and \(n-1\) degrees of freedom is equivalent to the square of a \(t\)-statistic with \(n-1\) degrees of freedom, we reject \(H_{0}\) if
\[ \sqrt{\frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{s^{2}}} \geq t_{n-1, \frac{\alpha}{2}}, \]
that is, if
\[ \left|\frac{\bar{y}-\mu_{0}}{\sqrt{s^{2} / n}}\right| \geq t_{n-1, \frac{\alpha}{2}} \]
Of course, this is the usual two-sided \(t\)-test
It can be shown that all the standard tests in situations with normal distributions are generalized likelihood ratio tests.
Wilks’ theorem
Samuel S. Wilks derived (in a 3 page paper) a much more generalized form of this, now known as Wilks’ theorem
Assume that the joint distribution of \(Y_{1}, \ldots, Y_{n}\) depends on \(p\) unknown parameters and that, under \(H_{0}\), the joint distribution depends on \(p_{0}\) unknown parameters. Let \(\nu=p-p_{0}\). Then, under some regularity conditions, when the null hypothesis is true, the distribution of the statistic \(-2 \log \{\lambda(\boldsymbol{Y})\}\) converges to a \(\chi_{\nu}^{2}\) distribution as the sample size \(n \rightarrow \infty\), i.e., when \(H_{0}\) is true and \(n\) is large, \[-2 \log \{\lambda(\boldsymbol{Y})\} \underset{\text { approx. }}{\sim} \chi_{\nu}^{2}\] Thus, for large \(n\), the critical region for a test with approximate significance level \(\alpha\) is \[\mathcal{R}=\left\{\boldsymbol{y}:-2 \log \{\lambda(\boldsymbol{y})\} \geq \chi_{\nu, \alpha}^{2}\right\}\]
In other words, the likelihood ratio test of 2 competing hypotheses is approximately distributed according to a chi-square distribution with \(\nu\) degrees of freedom, where \(\nu\) is the difference in number of parameters between the two models.
This is why the likelihood ratio test we performed on the politeness term followed a chi-square distribution with 1 degree of freedom: the politeness term adds 1 parameter to be estimated to the model.
Example 2: Poisson variable
Suppose that \(Y_{i} \sim \operatorname{Poisson}(\lambda)\) and consider testing \(H_{0}: \lambda=\lambda_{0}\) against \(H_{1}: \lambda \neq \lambda_{0}\). Now, the likelihood is
\[ L(\lambda \mid \boldsymbol{y})=\frac{\lambda^{\sum_{i=1}^{n} y_{i}} e^{-n \lambda}}{\prod_{i=1}^{n} y_{i} !} \]
The maximum likelihood estimate of \(\lambda\) is \(\hat{\lambda}=\bar{y}\) and the restricted maximum likelihood estimate of \(\lambda\) under \(H_{0}\) is \(\hat{\lambda}_{0}=\lambda_{0}\). Thus, the generalized likelihood ratio is
\[ \begin{aligned} \lambda(\boldsymbol{y})=\frac{L\left(\lambda_{0} \mid \boldsymbol{y}\right)}{L(\hat{\lambda} \mid \boldsymbol{y})} &=\frac{\lambda_{0}^{\sum_{i=1}^{n} y_{i}} e^{-n \lambda_{0}}}{\prod_{i=1}^{n} y_{i} !} \frac{\prod_{i=1}^{n} y_{i} !}{\bar{y}^{\sum_{i=1}^{n} y_{i}} e^{-n \bar{y}}} \\ &=\left(\frac{\lambda_{0}}{\bar{y}}\right)^{\sum_{i=1}^{n} y_{i}} e^{n\left(\bar{y}-\lambda_{0}\right)} \end{aligned} \]
It follows that
\[ \begin{aligned} -2 \log \{\lambda(\boldsymbol{y})\} &=-2\left\{n \bar{y} \log \left(\frac{\lambda_{0}}{\bar{y}}\right)+n\left(\bar{y}-\lambda_{0}\right)\right\} \\ &=2 n\left\{\bar{y} \log \left(\frac{\bar{y}}{\lambda_{0}}\right)+\lambda_{0}-\bar{y}\right\} \end{aligned} \]
Here, \(p=1\) and \(p_{0}=0\), and so \(\nu=1\). Therefore, by Wilks’ theorem, when \(H_{0}\) is true and \(n\) is large,
\[ 2 n\left\{\bar{Y} \log \left(\frac{\bar{Y}}{\lambda_{0}}\right)+\lambda_{0}-\bar{Y}\right\} \sim \chi_{1}^{2} . \]
Hence, for a test with approximate significance level \(\alpha\), we reject \(H_{0}\) if and only if
\[ 2 n\left\{\bar{y} \log \left(\frac{\bar{y}}{\lambda_{0}}\right)+\lambda_{0}-\bar{y}\right\} \geq \chi_{1, \alpha}^{2} . \]