Main notes

Simplest example: 1 random effect

library(tidyverse)

# table copied from loom example in second link
looms = read.table(text="1 98 97 99 96
2 91 90 93 92
3 96 95 97 95
4 95 96 99 98") %>% 
  column_to_rownames("V1") %>% 
  t %>% 
  as.data.frame %>% 
  remove_rownames %>% 
  pivot_longer(1:4,names_to="loom",values_to="strength") %>% 
  arrange(loom)

print(looms,n=Inf)
## # A tibble: 16 x 2
##    loom  strength
##    <chr>    <int>
##  1 1           98
##  2 1           97
##  3 1           99
##  4 1           96
##  5 2           91
##  6 2           90
##  7 2           93
##  8 2           92
##  9 3           96
## 10 3           95
## 11 3           97
## 12 3           95
## 13 4           95
## 14 4           96
## 15 4           99
## 16 4           98
library(lme4)

lmer.looms = lmer(strength ~ (1 | loom), data=looms)

summary(lmer.looms)
## Linear mixed model fit by REML ['lmerMod']
## Formula: strength ~ (1 | loom)
##    Data: looms
## 
## REML criterion at convergence: 63.2
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.38018 -0.57260 -0.04342  0.82574  1.52491 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  loom     (Intercept) 6.958    2.638   
##  Residual             1.896    1.377   
## Number of obs: 16, groups:  loom, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   95.438      1.363   70.01
fixef(lmer.looms)
## (Intercept) 
##     95.4375
ranef(lmer.looms)
## $loom
##   (Intercept)
## 1   1.9309741
## 2  -3.6864051
## 3   0.2925718
## 4   1.4628591
## 
## with conditional variances for "loom"
confint(profile(lmer.looms))
##                  2.5 %    97.5 %
## .sig01       1.1293759  5.748927
## .sigma       0.9672533  2.184005
## (Intercept) 92.4393299 98.435668
par(mfrow=c(1,2),cex=.6)
plot(fitted(lmer.looms),resid(lmer.looms),main="Resid v. fitted")
qqnorm(resid(lmer.looms))

# getting equivalent ANOVA table by running anova() of fixed-effect version
anova(aov(strength ~ loom, data=looms))
## Analysis of Variance Table
## 
## Response: strength
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## loom       3 89.188 29.7292  15.681 0.0001878 ***
## Residuals 12 22.750  1.8958                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example 2: mixed effect model

This example adapted from here.

This dataset was collected from an experiment done on how someone’s pitch (i.e. frequency of sound) changed depending on politeness and scenario 1. Subjects were asked to imagine a certain scenario and then asked to read a line in either a formal or an informal register. From the original study:

In the design that we used, there’s an additional source of non-independence that needs to be accounted for: We had different items. One item, for example, was an “asking for a favor” scenario. Here, subjects had to imagine asking a professor for a favor (polite condition), or asking a peer for a favor (informal condition). Another item was an “excusing for coming too late” scenario, which was similarly divided between polite and informal. In total, there were 7 such different items.

Below are two plots showing the mean pitch for different subjects and also for different items (scenarios) that were provided.

In the original paper, the researcher decided to fit politeness and sex as fixed effects (because they represent two specific levels of interest that could bet reated as having fixed, constant values over the entire population), and to fit the subjects and items as random effects (because these represent a sample of non-specific levels whose specific values are NOT interesting, and which are drawn from a group of many other possible subjects/items). They arrived at a model equation like this:

pitch ~ politeness + sex + (1 | subject) + (1 | item) + e

This is called a mixed effects model due to the presence of both random and fixed effects. The flexibility of these mixed effect model structures allows you to more accurately represent the experimental design and thus the variance structure of the data.

Boxplots showing the breakdown of mean pitch by politeness and sex.

politeness = read_csv("https://pastebin.com/raw/SC6Likfa")

lmer.polite = lmer(frequency~attitude+gender+(1|subject)+(1|scenario), data=politeness)
summary(lmer.polite)
## Linear mixed model fit by REML ['lmerMod']
## Formula: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
##    Data: politeness
## 
## REML criterion at convergence: 775.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2591 -0.6236 -0.0772  0.5388  3.4795 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  scenario (Intercept) 219.5    14.81   
##  subject  (Intercept) 615.6    24.81   
##  Residual             645.9    25.41   
## Number of obs: 83, groups:  scenario, 7; subject, 6
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  256.846     16.116  15.938
## attitudepol  -19.721      5.584  -3.532
## genderM     -108.516     21.013  -5.164
## 
## Correlation of Fixed Effects:
##             (Intr) atttdp
## attitudepol -0.173       
## genderM     -0.652  0.004
ranef(lmer.polite)
## $scenario
##   (Intercept)
## 1  -13.506509
## 2    6.582894
## 3   11.407831
## 4   20.629415
## 5   -1.936030
## 6  -12.173903
## 7  -11.003698
## 
## $subject
##    (Intercept)
## F1  -13.907651
## F2   10.419110
## F3    3.488541
## M3   28.382033
## M4    5.378484
## M7  -33.760517
## 
## with conditional variances for "scenario" "subject"
fixef(lmer.polite)
## (Intercept) attitudepol     genderM 
##   256.84627   -19.72111  -108.51635
confint(profile(lmer.polite))
##                   2.5 %    97.5 %
## .sig01         6.864421  29.99058
## .sig02        11.420516  42.73985
## .sigma        21.599106  30.05092
## (Intercept)  226.084997 287.60823
## attitudepol  -30.739625  -8.69669
## genderM     -149.340247 -67.68308

Unfortunately, computing \(p\)-values becomes slightly more complicated, and there’s different approaches. One of the easier approaches is to use a likelihood ratio test. We fit two versions of the model, one with the term we want to test, and one without, and then use the anova() function to compare the two. For lmer() fits, make sure to set REML=FALSE in the fit (two fits with REML=TRUE cannot be compared against each other).

For this example, we wish to find if politeness affects pitch, so we can compare the following models against each other:

lmer.polite.reduced = lmer(frequency~gender+(1|subject)+(1|scenario), data=politeness, REML=FALSE)
lmer.polite.full    = lmer(frequency~attitude+gender+(1|subject)+(1|scenario), data=politeness, REML=FALSE)

anova(lmer.polite.reduced,lmer.polite.full)
## Data: politeness
## Models:
## lmer.polite.reduced: frequency ~ gender + (1 | subject) + (1 | scenario)
## lmer.polite.full: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
##                     npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## lmer.polite.reduced    5 816.72 828.81 -403.36   806.72                     
## lmer.polite.full       6 807.10 821.61 -397.55   795.10 11.618  1  0.0006532
##                        
## lmer.polite.reduced    
## lmer.polite.full    ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This could be interpreted in the following way:

“… politeness significantly affected pitch (\(\chi^2_1=11.62\), \(p=0.00065\)), lowering it by about 19.7 Hz \(\pm\) 5.6 (standard errors) …”

par(mfrow=c(1,2),cex=.6)
plot(fitted(lmer.polite),resid(lmer.polite),main="Resid v. fitted")
qqnorm(resid(lmer.polite))

Some other examples to briefly scan over
  • Example using glmer() to fit a mixed effects logistic regression model to predict
  • Nature paper discussion flexibility of mixed effects models
  • Example of using lmer() to analyze the famous sleep deprivation dataset, which has a longitudinal component




Brief detour: Likelihood Ratio Test

largely adapted from source

The likelihood ratio test is a very broad and generalizable method of testing hypotheses that is extremely versatile.

Define null hypothesis \(H_{0}: \boldsymbol{\theta} \in \Theta^{\star}\) (meaning the parameter \(\theta\) (which is allowed to be a vector) is in the restricted (i.e. nested) set \(\Theta^{\star}\)) against the alternative hypothesis \(H_{1}: \boldsymbol{\theta} \in \Theta \backslash \Theta^{\star}\) (meaning \(\theta\) is allowed to be in the more general set \(\Theta\backslash \Theta^{\star}\)).

Then, define the following likelihood ratio test statistic where \(L(\boldsymbol{\theta} \mid \boldsymbol{y})\) represents the likelihood function for \(\theta\).

\[ \lambda(\boldsymbol{y})=\frac{\max _{\boldsymbol{\theta} \in \Theta^{\star}} L(\boldsymbol{\theta} \mid \boldsymbol{y})}{\max _{\boldsymbol{\theta} \in \Theta} L(\boldsymbol{\theta} \mid \boldsymbol{y})} \]

We can use this to define a test where we choose some value \(a\) such that rejecting a true null hypothesis when \(\lambda(\boldsymbol{Y})\leq a\) has probability \(\alpha\). In other words, we choose some threshold critical value \(a\) that controls the type I error rate at \(\alpha\)

\[ P\left(\lambda(\boldsymbol{Y}) \leq a \mid H_{0}\right)=\alpha \]

If we let \(\hat{\boldsymbol\theta}\) denote the maximum likelihood estimate of \(\boldsymbol{\theta}\) and let \(\widehat{\boldsymbol{\theta}}_{0}\) denote the value of \(\boldsymbol\theta\) which maximizes the likelihood over all values of \(\boldsymbol{\theta}\) in \(\Theta^{\star}\), then we may write

\[ \lambda(\boldsymbol{y})=\frac{L(\hat{\boldsymbol{\theta}}_{0} \mid \boldsymbol{y})}{L(\hat{\boldsymbol{\theta}} \mid \boldsymbol{y})} \]

The quantity \(\widehat{\boldsymbol\theta}_{0}\) is called the restricted maximum likelihood estimate of \(\boldsymbol\theta\) under \(H_{0} .\)


Example 1: Normal distribution hypothesis test

Suppose that \(Y_{i} \stackrel{i.i.d}{\sim}{\mathcal{N}}\left(\mu, \sigma^{2}\right)\) and consider testing \(H_{0}: \mu=\mu_{0}\) against \(H_{1}: \mu \neq \mu_{0}\). Then the likelihood is

\[ L\left(\mu, \sigma^{2} \boldsymbol{y}\right)=\left(2 \pi \sigma^{2}\right)^{-\frac{n}{2}} \exp \left\{-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}\right\} \]

The maximum likelihood estimate of \(\boldsymbol{\theta}=\left(\mu, \sigma^{2}\right)^{\mathrm{T}}\) is \(\widehat{\boldsymbol{\theta}}=\left(\widehat{\mu}, \widehat{\sigma^{2}}\right)^{\mathrm{T}}\), where \(\widehat{\mu}=\bar{y}\) and \(\widehat{\sigma}^{2}=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} / n\).

Similarly, the restricted maximum likelihood estimate of \(\boldsymbol{\theta}=\left(\mu, \sigma^{2}\right)^{\mathrm{T}}\) under \(H_{0}\) is \(\widehat{\boldsymbol{\theta}}_{0}=\left(\widehat{\mu}_{0}, \widehat{\sigma^{2}}_{0}\right)^{\mathrm{T}}\), where \(\widehat{\mu}_{0}=\mu_{0}\) and \(\widehat{\sigma}_{0}^{2}=\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2} / n .\)

Thus, the generalized likelihood ratio is

\[ \begin{aligned} \lambda(\boldsymbol{y})=\frac{L\left(\mu_{0}, \hat{\sigma}_{0}^{2} \mid \boldsymbol{y}\right)}{L\left(\hat{\mu}, \hat{\sigma}^{2} \mid \boldsymbol{y}\right)} &=\frac{\left(2 \pi \hat{\sigma}_{0}^{2}\right)^{-\frac{n}{2}} \exp \left\{-\frac{1}{2 \hat{\sigma}_{0}^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}\right\}}{\left(2 \pi \hat{\sigma}^{2}\right)^{-\frac{n}{2}} \exp \left\{-\frac{1}{2 \hat{\sigma}^{2}} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}\right\}} \\ &=\left(\frac{\hat{\sigma}^{2}}{\hat{\sigma}_{0}^{2}}\right)^{\frac{n}{2}} \exp \left\{\frac{n \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{2 \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}-\frac{n \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}{2 \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}\right\} \\ &=\left\{\frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}\right\}^{\frac{n}{2}} \end{aligned} \]

Then, there exists some \(a\) such that we reject \(H_{0}\) if

\[ \left\{\frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}\right\}^{\frac{n}{2}} \leq a \Rightarrow \frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}} \leq b \]

where \(a\) and \(b\) are constants chosen to give significance level \(\alpha\). Now, we may write

\[ \begin{aligned} \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2} &=\sum_{i=1}^{n}\left\{\left(y_{i}-\bar{y}\right)+\left(\bar{y}-\mu_{0}\right)\right\}^{2} \\ &=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}+2\left(\bar{y}-\mu_{0}\right) \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)+n\left(\bar{y}-\mu_{0}\right)^{2} \\ &=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}+n\left(\bar{y}-\mu_{0}\right)^{2} \end{aligned} \]

So we reject \(H_{0}\) if

\[ \frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}+n\left(\bar{y}-\mu_{0}\right)^{2}} \leq b \]

Thus, rearranging, we reject \(H_{0}\) if

\[ 1+\frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}} \geq c \Rightarrow \frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}} \geq d, \]

where \(c\) and \(d\) are constants chosen to give significance level \(\alpha\), that is we can write

\[ \alpha=P\left(\lambda(\boldsymbol{Y}) \leq a \mid H_{0}\right)=P\left(\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{S^{2}} \geq d \mid H_{0}\right) \]

where \(S^{2}=\frac{1}{n-1} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}\).

To get \(d\) we need to work out the distribution of \(\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{S^{2}}\) under the null hypothesis. For \(Y_{i} \sim \mathcal{N}\left(\mu, \sigma^{2}\right)\) we have

\[ \bar{Y} \sim N\left(\mu, \frac{\sigma^{2}}{n}\right) \]

and so, under \(H_{0}\),

\[ \bar{Y} \sim \mathcal{N}\left(\mu_{0}, \frac{\sigma^{2}}{n}\right) \quad \text {and}\quad \frac{\sqrt{n}\left(\bar{Y}-\mu_{0}\right)}{\sqrt{\sigma^{2}}} \sim \mathcal{N}(0,1) . \]

By Cochran’s theorem, we have

\[ \frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{\sigma^{2}} \sim \chi_{1}^{2} \quad\text{and}\quad \frac{(n-1) S^{2}}{\sigma^{2}} \sim \chi_{n-1}^{2} \]

Now we may use the fact that if \(U\) and \(V\) are independent rvs such that \(U \sim \chi_{\nu_{1}}^{2}\) and \(V \sim \chi_{\nu_{2}}^{2}\), then \(\frac{U / \nu_{1}}{V / \nu_{2}} \sim \mathcal{F}_{\nu_{1}, \nu_{2}}\). Here, \(U=\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{\sigma^{2}}\) and \(V=\frac{(n-1) S^{2}}{\sigma^{2}}\). Hence, if \(H_{0}\) is true, we have

\[ F=\frac{U / 1}{V /(n-1)}=\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{S^{2}} \sim \mathcal{F}_{1, n-1} . \]

Therefore, we reject \(H_{0}\) at a significance level \(\alpha\) if

\[ \frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{s^{2}} \geq F_{1, n-1, \alpha} \]

where \(F_{1, n-1, \alpha}\) is such that \(P\left(F \geq F_{1, n-1, \alpha}\right)=\alpha\). Using the fact that an \(F\) distribution with \(1\) and \(n-1\) degrees of freedom is equivalent to the square of a \(t\)-statistic with \(n-1\) degrees of freedom, we reject \(H_{0}\) if

\[ \sqrt{\frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{s^{2}}} \geq t_{n-1, \frac{\alpha}{2}}, \]

that is, if

\[ \left|\frac{\bar{y}-\mu_{0}}{\sqrt{s^{2} / n}}\right| \geq t_{n-1, \frac{\alpha}{2}} \]

Of course, this is the usual two-sided \(t\)-test

It can be shown that all the standard tests in situations with normal distributions are generalized likelihood ratio tests.


Wilks’ theorem

Samuel S. Wilks derived (in a 3 page paper) a much more generalized form of this, now known as Wilks’ theorem

Assume that the joint distribution of \(Y_{1}, \ldots, Y_{n}\) depends on \(p\) unknown parameters and that, under \(H_{0}\), the joint distribution depends on \(p_{0}\) unknown parameters. Let \(\nu=p-p_{0}\). Then, under some regularity conditions, when the null hypothesis is true, the distribution of the statistic \(-2 \log \{\lambda(\boldsymbol{Y})\}\) converges to a \(\chi_{\nu}^{2}\) distribution as the sample size \(n \rightarrow \infty\), i.e., when \(H_{0}\) is true and \(n\) is large, \[-2 \log \{\lambda(\boldsymbol{Y})\} \underset{\text { approx. }}{\sim} \chi_{\nu}^{2}\] Thus, for large \(n\), the critical region for a test with approximate significance level \(\alpha\) is \[\mathcal{R}=\left\{\boldsymbol{y}:-2 \log \{\lambda(\boldsymbol{y})\} \geq \chi_{\nu, \alpha}^{2}\right\}\]

In other words, the likelihood ratio test of 2 competing hypotheses is approximately distributed according to a chi-square distribution with \(\nu\) degrees of freedom, where \(\nu\) is the difference in number of parameters between the two models.

This is why the likelihood ratio test we performed on the politeness term followed a chi-square distribution with 1 degree of freedom: the politeness term adds 1 parameter to be estimated to the model.


Example 2: Poisson variable

Suppose that \(Y_{i} \sim \operatorname{Poisson}(\lambda)\) and consider testing \(H_{0}: \lambda=\lambda_{0}\) against \(H_{1}: \lambda \neq \lambda_{0}\). Now, the likelihood is

\[ L(\lambda \mid \boldsymbol{y})=\frac{\lambda^{\sum_{i=1}^{n} y_{i}} e^{-n \lambda}}{\prod_{i=1}^{n} y_{i} !} \]

The maximum likelihood estimate of \(\lambda\) is \(\hat{\lambda}=\bar{y}\) and the restricted maximum likelihood estimate of \(\lambda\) under \(H_{0}\) is \(\hat{\lambda}_{0}=\lambda_{0}\). Thus, the generalized likelihood ratio is

\[ \begin{aligned} \lambda(\boldsymbol{y})=\frac{L\left(\lambda_{0} \mid \boldsymbol{y}\right)}{L(\hat{\lambda} \mid \boldsymbol{y})} &=\frac{\lambda_{0}^{\sum_{i=1}^{n} y_{i}} e^{-n \lambda_{0}}}{\prod_{i=1}^{n} y_{i} !} \frac{\prod_{i=1}^{n} y_{i} !}{\bar{y}^{\sum_{i=1}^{n} y_{i}} e^{-n \bar{y}}} \\ &=\left(\frac{\lambda_{0}}{\bar{y}}\right)^{\sum_{i=1}^{n} y_{i}} e^{n\left(\bar{y}-\lambda_{0}\right)} \end{aligned} \]

It follows that

\[ \begin{aligned} -2 \log \{\lambda(\boldsymbol{y})\} &=-2\left\{n \bar{y} \log \left(\frac{\lambda_{0}}{\bar{y}}\right)+n\left(\bar{y}-\lambda_{0}\right)\right\} \\ &=2 n\left\{\bar{y} \log \left(\frac{\bar{y}}{\lambda_{0}}\right)+\lambda_{0}-\bar{y}\right\} \end{aligned} \]

Here, \(p=1\) and \(p_{0}=0\), and so \(\nu=1\). Therefore, by Wilks’ theorem, when \(H_{0}\) is true and \(n\) is large,

\[ 2 n\left\{\bar{Y} \log \left(\frac{\bar{Y}}{\lambda_{0}}\right)+\lambda_{0}-\bar{Y}\right\} \sim \chi_{1}^{2} . \]

Hence, for a test with approximate significance level \(\alpha\), we reject \(H_{0}\) if and only if

\[ 2 n\left\{\bar{y} \log \left(\frac{\bar{y}}{\lambda_{0}}\right)+\lambda_{0}-\bar{y}\right\} \geq \chi_{1, \alpha}^{2} . \]


Example 3: Model selection using LRT

  1. https://www.researchgate.net/publication/256935591_The_phonetic_profile_of_Korean_formal_and_informal_speech_registers↩︎