Processing math: 3%

Main notes

Simplest example: 1 random effect

library(tidyverse)

# table copied from loom example in second link
looms = read.table(text="1 98 97 99 96
2 91 90 93 92
3 96 95 97 95
4 95 96 99 98") %>% 
  column_to_rownames("V1") %>% 
  t %>% 
  as.data.frame %>% 
  remove_rownames %>% 
  pivot_longer(1:4,names_to="loom",values_to="strength") %>% 
  arrange(loom)

print(looms,n=Inf)
## # A tibble: 16 x 2
##    loom  strength
##    <chr>    <int>
##  1 1           98
##  2 1           97
##  3 1           99
##  4 1           96
##  5 2           91
##  6 2           90
##  7 2           93
##  8 2           92
##  9 3           96
## 10 3           95
## 11 3           97
## 12 3           95
## 13 4           95
## 14 4           96
## 15 4           99
## 16 4           98
library(lme4)

lmer.looms = lmer(strength ~ (1 | loom), data=looms)

summary(lmer.looms)
## Linear mixed model fit by REML ['lmerMod']
## Formula: strength ~ (1 | loom)
##    Data: looms
## 
## REML criterion at convergence: 63.2
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.38018 -0.57260 -0.04342  0.82574  1.52491 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  loom     (Intercept) 6.958    2.638   
##  Residual             1.896    1.377   
## Number of obs: 16, groups:  loom, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   95.438      1.363   70.01
fixef(lmer.looms)
## (Intercept) 
##     95.4375
ranef(lmer.looms)
## $loom
##   (Intercept)
## 1   1.9309741
## 2  -3.6864051
## 3   0.2925718
## 4   1.4628591
## 
## with conditional variances for "loom"
confint(profile(lmer.looms))
##                  2.5 %    97.5 %
## .sig01       1.1293759  5.748927
## .sigma       0.9672533  2.184005
## (Intercept) 92.4393299 98.435668
par(mfrow=c(1,2),cex=.6)
plot(fitted(lmer.looms),resid(lmer.looms),main="Resid v. fitted")
qqnorm(resid(lmer.looms))

# getting equivalent ANOVA table by running anova() of fixed-effect version
anova(aov(strength ~ loom, data=looms))
## Analysis of Variance Table
## 
## Response: strength
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## loom       3 89.188 29.7292  15.681 0.0001878 ***
## Residuals 12 22.750  1.8958                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example 2: mixed effect model

This example adapted from here.

This dataset was collected from an experiment done on how someone’s pitch (i.e. frequency of sound) changed depending on politeness and scenario 1. Subjects were asked to imagine a certain scenario and then asked to read a line in either a formal or an informal register. From the original study:

In the design that we used, there’s an additional source of non-independence that needs to be accounted for: We had different items. One item, for example, was an “asking for a favor” scenario. Here, subjects had to imagine asking a professor for a favor (polite condition), or asking a peer for a favor (informal condition). Another item was an “excusing for coming too late” scenario, which was similarly divided between polite and informal. In total, there were 7 such different items.

Below are two plots showing the mean pitch for different subjects and also for different items (scenarios) that were provided.

In the original paper, the researcher decided to fit politeness and sex as fixed effects (because they represent two specific levels of interest that could bet reated as having fixed, constant values over the entire population), and to fit the subjects and items as random effects (because these represent a sample of non-specific levels whose specific values are NOT interesting, and which are drawn from a group of many other possible subjects/items). They arrived at a model equation like this:

pitch ~ politeness + sex + (1 | subject) + (1 | item) + e

This is called a mixed effects model due to the presence of both random and fixed effects. The flexibility of these mixed effect model structures allows you to more accurately represent the experimental design and thus the variance structure of the data.

Boxplots showing the breakdown of mean pitch by politeness and sex.

politeness = read_csv("https://pastebin.com/raw/SC6Likfa")

lmer.polite = lmer(frequency~attitude+gender+(1|subject)+(1|scenario), data=politeness)
summary(lmer.polite)
## Linear mixed model fit by REML ['lmerMod']
## Formula: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
##    Data: politeness
## 
## REML criterion at convergence: 775.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2591 -0.6236 -0.0772  0.5388  3.4795 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  scenario (Intercept) 219.5    14.81   
##  subject  (Intercept) 615.6    24.81   
##  Residual             645.9    25.41   
## Number of obs: 83, groups:  scenario, 7; subject, 6
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  256.846     16.116  15.938
## attitudepol  -19.721      5.584  -3.532
## genderM     -108.516     21.013  -5.164
## 
## Correlation of Fixed Effects:
##             (Intr) atttdp
## attitudepol -0.173       
## genderM     -0.652  0.004
ranef(lmer.polite)
## $scenario
##   (Intercept)
## 1  -13.506509
## 2    6.582894
## 3   11.407831
## 4   20.629415
## 5   -1.936030
## 6  -12.173903
## 7  -11.003698
## 
## $subject
##    (Intercept)
## F1  -13.907651
## F2   10.419110
## F3    3.488541
## M3   28.382033
## M4    5.378484
## M7  -33.760517
## 
## with conditional variances for "scenario" "subject"
fixef(lmer.polite)
## (Intercept) attitudepol     genderM 
##   256.84627   -19.72111  -108.51635
confint(profile(lmer.polite))
##                   2.5 %    97.5 %
## .sig01         6.864421  29.99058
## .sig02        11.420516  42.73985
## .sigma        21.599106  30.05092
## (Intercept)  226.084997 287.60823
## attitudepol  -30.739625  -8.69669
## genderM     -149.340247 -67.68308

Unfortunately, computing p-values becomes slightly more complicated, and there’s different approaches. One of the easier approaches is to use a likelihood ratio test. We fit two versions of the model, one with the term we want to test, and one without, and then use the anova() function to compare the two. For lmer() fits, make sure to set REML=FALSE in the fit (two fits with REML=TRUE cannot be compared against each other).

For this example, we wish to find if politeness affects pitch, so we can compare the following models against each other:

lmer.polite.reduced = lmer(frequency~gender+(1|subject)+(1|scenario), data=politeness, REML=FALSE)
lmer.polite.full    = lmer(frequency~attitude+gender+(1|subject)+(1|scenario), data=politeness, REML=FALSE)

anova(lmer.polite.reduced,lmer.polite.full)
## Data: politeness
## Models:
## lmer.polite.reduced: frequency ~ gender + (1 | subject) + (1 | scenario)
## lmer.polite.full: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
##                     npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## lmer.polite.reduced    5 816.72 828.81 -403.36   806.72                     
## lmer.polite.full       6 807.10 821.61 -397.55   795.10 11.618  1  0.0006532
##                        
## lmer.polite.reduced    
## lmer.polite.full    ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This could be interpreted in the following way:

“… politeness significantly affected pitch (χ21=11.62, p=0.00065), lowering it by about 19.7 Hz ± 5.6 (standard errors) …”

par(mfrow=c(1,2),cex=.6)
plot(fitted(lmer.polite),resid(lmer.polite),main="Resid v. fitted")
qqnorm(resid(lmer.polite))

Some other examples to briefly scan over
  • Example using glmer() to fit a mixed effects logistic regression model to predict
  • Nature paper discussion flexibility of mixed effects models
  • Example of using lmer() to analyze the famous sleep deprivation dataset, which has a longitudinal component




Brief detour: Likelihood Ratio Test

largely adapted from source

The likelihood ratio test is a very broad and generalizable method of testing hypotheses that is extremely versatile.

Define null hypothesis H_{0}: \boldsymbol{\theta} \in \Theta^{\star} (meaning the parameter \theta (which is allowed to be a vector) is in the restricted (i.e. nested) set \Theta^{\star}) against the alternative hypothesis H_{1}: \boldsymbol{\theta} \in \Theta \backslash \Theta^{\star} (meaning \theta is allowed to be in the more general set \Theta\backslash \Theta^{\star}).

Then, define the following likelihood ratio test statistic where L(\boldsymbol{\theta} \mid \boldsymbol{y}) represents the likelihood function for \theta.

\lambda(\boldsymbol{y})=\frac{\max _{\boldsymbol{\theta} \in \Theta^{\star}} L(\boldsymbol{\theta} \mid \boldsymbol{y})}{\max _{\boldsymbol{\theta} \in \Theta} L(\boldsymbol{\theta} \mid \boldsymbol{y})}

We can use this to define a test where we choose some value a such that rejecting a true null hypothesis when \lambda(\boldsymbol{Y})\leq a has probability \alpha. In other words, we choose some threshold critical value a that controls the type I error rate at \alpha

P\left(\lambda(\boldsymbol{Y}) \leq a \mid H_{0}\right)=\alpha

If we let \hat{\boldsymbol\theta} denote the maximum likelihood estimate of \boldsymbol{\theta} and let \widehat{\boldsymbol{\theta}}_{0} denote the value of \boldsymbol\theta which maximizes the likelihood over all values of \boldsymbol{\theta} in \Theta^{\star}, then we may write

\lambda(\boldsymbol{y})=\frac{L(\hat{\boldsymbol{\theta}}_{0} \mid \boldsymbol{y})}{L(\hat{\boldsymbol{\theta}} \mid \boldsymbol{y})}

The quantity \widehat{\boldsymbol\theta}_{0} is called the restricted maximum likelihood estimate of \boldsymbol\theta under H_{0} .


Example 1: Normal distribution hypothesis test

Suppose that Y_{i} \stackrel{i.i.d}{\sim}{\mathcal{N}}\left(\mu, \sigma^{2}\right) and consider testing H_{0}: \mu=\mu_{0} against H_{1}: \mu \neq \mu_{0}. Then the likelihood is

L\left(\mu, \sigma^{2} \boldsymbol{y}\right)=\left(2 \pi \sigma^{2}\right)^{-\frac{n}{2}} \exp \left\{-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}\right\}

The maximum likelihood estimate of \boldsymbol{\theta}=\left(\mu, \sigma^{2}\right)^{\mathrm{T}} is \widehat{\boldsymbol{\theta}}=\left(\widehat{\mu}, \widehat{\sigma^{2}}\right)^{\mathrm{T}}, where \widehat{\mu}=\bar{y} and \widehat{\sigma}^{2}=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} / n.

Similarly, the restricted maximum likelihood estimate of \boldsymbol{\theta}=\left(\mu, \sigma^{2}\right)^{\mathrm{T}} under H_{0} is \widehat{\boldsymbol{\theta}}_{0}=\left(\widehat{\mu}_{0}, \widehat{\sigma^{2}}_{0}\right)^{\mathrm{T}}, where \widehat{\mu}_{0}=\mu_{0} and \widehat{\sigma}_{0}^{2}=\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2} / n .

Thus, the generalized likelihood ratio is

\begin{aligned} \lambda(\boldsymbol{y})=\frac{L\left(\mu_{0}, \hat{\sigma}_{0}^{2} \mid \boldsymbol{y}\right)}{L\left(\hat{\mu}, \hat{\sigma}^{2} \mid \boldsymbol{y}\right)} &=\frac{\left(2 \pi \hat{\sigma}_{0}^{2}\right)^{-\frac{n}{2}} \exp \left\{-\frac{1}{2 \hat{\sigma}_{0}^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}\right\}}{\left(2 \pi \hat{\sigma}^{2}\right)^{-\frac{n}{2}} \exp \left\{-\frac{1}{2 \hat{\sigma}^{2}} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}\right\}} \\ &=\left(\frac{\hat{\sigma}^{2}}{\hat{\sigma}_{0}^{2}}\right)^{\frac{n}{2}} \exp \left\{\frac{n \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{2 \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}-\frac{n \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}{2 \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}\right\} \\ &=\left\{\frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}\right\}^{\frac{n}{2}} \end{aligned}

Then, there exists some a such that we reject H_{0} if

\left\{\frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}}\right\}^{\frac{n}{2}} \leq a \Rightarrow \frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2}} \leq b

where a and b are constants chosen to give significance level \alpha. Now, we may write

\begin{aligned} \sum_{i=1}^{n}\left(y_{i}-\mu_{0}\right)^{2} &=\sum_{i=1}^{n}\left\{\left(y_{i}-\bar{y}\right)+\left(\bar{y}-\mu_{0}\right)\right\}^{2} \\ &=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}+2\left(\bar{y}-\mu_{0}\right) \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)+n\left(\bar{y}-\mu_{0}\right)^{2} \\ &=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}+n\left(\bar{y}-\mu_{0}\right)^{2} \end{aligned}

So we reject H_{0} if

\frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}+n\left(\bar{y}-\mu_{0}\right)^{2}} \leq b

Thus, rearranging, we reject H_{0} if

1+\frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}} \geq c \Rightarrow \frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}} \geq d,

where c and d are constants chosen to give significance level \alpha, that is we can write

\alpha=P\left(\lambda(\boldsymbol{Y}) \leq a \mid H_{0}\right)=P\left(\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{S^{2}} \geq d \mid H_{0}\right)

where S^{2}=\frac{1}{n-1} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}.

To get d we need to work out the distribution of \frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{S^{2}} under the null hypothesis. For Y_{i} \sim \mathcal{N}\left(\mu, \sigma^{2}\right) we have

\bar{Y} \sim N\left(\mu, \frac{\sigma^{2}}{n}\right)

and so, under H_{0},

\bar{Y} \sim \mathcal{N}\left(\mu_{0}, \frac{\sigma^{2}}{n}\right) \quad \text {and}\quad \frac{\sqrt{n}\left(\bar{Y}-\mu_{0}\right)}{\sqrt{\sigma^{2}}} \sim \mathcal{N}(0,1) .

By Cochran’s theorem, we have

\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{\sigma^{2}} \sim \chi_{1}^{2} \quad\text{and}\quad \frac{(n-1) S^{2}}{\sigma^{2}} \sim \chi_{n-1}^{2}

Now we may use the fact that if U and V are independent rvs such that U \sim \chi_{\nu_{1}}^{2} and V \sim \chi_{\nu_{2}}^{2}, then \frac{U / \nu_{1}}{V / \nu_{2}} \sim \mathcal{F}_{\nu_{1}, \nu_{2}}. Here, U=\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{\sigma^{2}} and V=\frac{(n-1) S^{2}}{\sigma^{2}}. Hence, if H_{0} is true, we have

F=\frac{U / 1}{V /(n-1)}=\frac{n\left(\bar{Y}-\mu_{0}\right)^{2}}{S^{2}} \sim \mathcal{F}_{1, n-1} .

Therefore, we reject H_{0} at a significance level \alpha if

\frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{s^{2}} \geq F_{1, n-1, \alpha}

where F_{1, n-1, \alpha} is such that P\left(F \geq F_{1, n-1, \alpha}\right)=\alpha. Using the fact that an F distribution with 1 and n-1 degrees of freedom is equivalent to the square of a t-statistic with n-1 degrees of freedom, we reject H_{0} if

\sqrt{\frac{n\left(\bar{y}-\mu_{0}\right)^{2}}{s^{2}}} \geq t_{n-1, \frac{\alpha}{2}},

that is, if

\left|\frac{\bar{y}-\mu_{0}}{\sqrt{s^{2} / n}}\right| \geq t_{n-1, \frac{\alpha}{2}}

Of course, this is the usual two-sided t-test

It can be shown that all the standard tests in situations with normal distributions are generalized likelihood ratio tests.


Wilks’ theorem

Samuel S. Wilks derived (in a 3 page paper) a much more generalized form of this, now known as Wilks’ theorem

Assume that the joint distribution of Y_{1}, \ldots, Y_{n} depends on p unknown parameters and that, under H_{0}, the joint distribution depends on p_{0} unknown parameters. Let \nu=p-p_{0}. Then, under some regularity conditions, when the null hypothesis is true, the distribution of the statistic -2 \log \{\lambda(\boldsymbol{Y})\} converges to a \chi_{\nu}^{2} distribution as the sample size n \rightarrow \infty, i.e., when H_{0} is true and n is large, -2 \log \{\lambda(\boldsymbol{Y})\} \underset{\text { approx. }}{\sim} \chi_{\nu}^{2} Thus, for large n, the critical region for a test with approximate significance level \alpha is \mathcal{R}=\left\{\boldsymbol{y}:-2 \log \{\lambda(\boldsymbol{y})\} \geq \chi_{\nu, \alpha}^{2}\right\}

In other words, the likelihood ratio test of 2 competing hypotheses is approximately distributed according to a chi-square distribution with \nu degrees of freedom, where \nu is the difference in number of parameters between the two models.

This is why the likelihood ratio test we performed on the politeness term followed a chi-square distribution with 1 degree of freedom: the politeness term adds 1 parameter to be estimated to the model.


Example 2: Poisson variable

Suppose that Y_{i} \sim \operatorname{Poisson}(\lambda) and consider testing H_{0}: \lambda=\lambda_{0} against H_{1}: \lambda \neq \lambda_{0}. Now, the likelihood is

L(\lambda \mid \boldsymbol{y})=\frac{\lambda^{\sum_{i=1}^{n} y_{i}} e^{-n \lambda}}{\prod_{i=1}^{n} y_{i} !}

The maximum likelihood estimate of \lambda is \hat{\lambda}=\bar{y} and the restricted maximum likelihood estimate of \lambda under H_{0} is \hat{\lambda}_{0}=\lambda_{0}. Thus, the generalized likelihood ratio is

\begin{aligned} \lambda(\boldsymbol{y})=\frac{L\left(\lambda_{0} \mid \boldsymbol{y}\right)}{L(\hat{\lambda} \mid \boldsymbol{y})} &=\frac{\lambda_{0}^{\sum_{i=1}^{n} y_{i}} e^{-n \lambda_{0}}}{\prod_{i=1}^{n} y_{i} !} \frac{\prod_{i=1}^{n} y_{i} !}{\bar{y}^{\sum_{i=1}^{n} y_{i}} e^{-n \bar{y}}} \\ &=\left(\frac{\lambda_{0}}{\bar{y}}\right)^{\sum_{i=1}^{n} y_{i}} e^{n\left(\bar{y}-\lambda_{0}\right)} \end{aligned}

It follows that

\begin{aligned} -2 \log \{\lambda(\boldsymbol{y})\} &=-2\left\{n \bar{y} \log \left(\frac{\lambda_{0}}{\bar{y}}\right)+n\left(\bar{y}-\lambda_{0}\right)\right\} \\ &=2 n\left\{\bar{y} \log \left(\frac{\bar{y}}{\lambda_{0}}\right)+\lambda_{0}-\bar{y}\right\} \end{aligned}

Here, p=1 and p_{0}=0, and so \nu=1. Therefore, by Wilks’ theorem, when H_{0} is true and n is large,

2 n\left\{\bar{Y} \log \left(\frac{\bar{Y}}{\lambda_{0}}\right)+\lambda_{0}-\bar{Y}\right\} \sim \chi_{1}^{2} .

Hence, for a test with approximate significance level \alpha, we reject H_{0} if and only if

2 n\left\{\bar{y} \log \left(\frac{\bar{y}}{\lambda_{0}}\right)+\lambda_{0}-\bar{y}\right\} \geq \chi_{1, \alpha}^{2} .


Example 3: Model selection using LRT

  1. https://www.researchgate.net/publication/256935591_The_phonetic_profile_of_Korean_formal_and_informal_speech_registers↩︎