Recall the “hot hands” example from the Monte Carlo section. How did we meaningfully determine if the player did or did not have hot hands in that example?
First, we assumed throws were identical and independent. Then we noted that under this assumption, we can make certain predictions about the data. In particular, we noted that this assumption implied that the sequence of throws was unimportant. Thus, the throws can be arbitrarily shuffled without changing any statistical properties of the data. Then, it was fairly easy to generate new, randomly shuffled versions (i.e. permutations) of the data and compare a particular property of the original data v. the permuted data.
What we did may seem fairly ad-hoc, however this actually closely follows the standard procedure used in testing called the Neyman-Pearson framework
(below notes taken from Karl Rohe)
The above example is a hypothesis testing problem. Here is the outline for how we do these problems:
In this process, we combine our understanding of the world, our data, and our model. With all three, we make statistical inference.
The above logic performs a Monte Carlo simulation under a hypothesized reality. Suppose that you observe a statistic \(s\). Using a simulation of hypothesized reality, we can simulate a random variable \(S\) that we imagine as mimicking \(s\) under the null. Using Monte Carlo, we can compute (approximate) \(P(S \ge s)\) or \(P(S\le s)\) (i.e. one-sided tests) or any other comparison that seems reasonable given the circumstances (e.g. two-sided tests, etc). If the null model is unlikely to generate a value equal to \(s\), or more extreme, then we reject the null hypothesis.
If you have learned a little bit about hypothesis testing before this class, then the above logic might look very strange. However, if scientists had “electronic computers” when they first started doing statistics, then this is exactly what they would have done and it is exactly what you would have been taught. However, when us humans started doing statistics, it was very hard to simulate lots of \(S\) values. So, we came up with mathematical formulas to approximate the above logic. Those formulas are the things you might have previously learned. Once you understand the logic above, I hope that you can return to those formulas and find a deeper understanding for them.
A friend wants to gamble with you and proposes a coin flip game. Before playing the game with her, you want to ensure whether the coin is fair.
Let’s follow the logic above…
library(purrr)
simulate_S = function(){
  
  lots_of_random_variables = rbernoulli(n = 1000, p = 1/2)
  X = mean(lots_of_random_variables)
  
  ## or, you could write:
  # X = rbinom(n = 1, size = 1000, prob = 1/2) / 1000
  
  S = X - 1/2
  return(S)
}Then, we would find large values of \(S\) to be surprising; large and positive or large and negative. This is a “two-sided test”.
check_if_S_in_surprising_set = function(S){
  return(abs(S) >= .04)
}library(dplyr)
r = 10000
monte_carlo = data.frame(replicate = 1:r, 
                         S = rep(NA,r), 
                         S_in_suprising_set = rep(NA, r)) 
for(i in 1:r){
  monte_carlo$S[i] = simulate_S()
  monte_carlo$S_in_suprising_set[i] = check_if_S_in_surprising_set(monte_carlo$S[i])
 }
monte_carlo = as_tibble(monte_carlo)
monte_carlo %>% summarise(mean(S_in_suprising_set))## # A tibble: 1 x 1
##   `mean(S_in_suprising_set)`
##                        <dbl>
## 1                     0.0109So, if the coin were fair, this would happen only about 1% of the time. So, I would say that what we’ve observed (54% heads) is pretty surprising!
What has changed in the logic above?
Here is the simulation…
check_if_S_in_surprising_set = function(S){
  return(abs(S) >= .02)  # why .02????
}
library(dplyr)
r = 10000
monte_carlo = data.frame(replicate = 1:r, 
                         S = rep(NA,r), 
                         S_in_suprising_set = rep(NA, r)) 
for(i in 1:r){
  monte_carlo$S[i] = simulate_S()
  monte_carlo$S_in_suprising_set[i] = check_if_S_in_surprising_set(monte_carlo$S[i])
 }
monte_carlo = as_tibble(monte_carlo)
monte_carlo %>% summarise(mean(S_in_suprising_set))## # A tibble: 1 x 1
##   `mean(S_in_suprising_set)`
##                        <dbl>
## 1                      0.220So, if the coin were fair, then roughly 20% of experiments would create something at least as surprising as 52% heads. I don’t think this is that surprising. Does this mean that the coin is fair? No. It means that we don’t have enough evidence to suggest it is unfair. That is, we don’t “accept the null hypothsis (fair coin)”. We never “accept” the null hypothesis.
What has changed in the logic above?
Compared to the previous question, we have the same proportion (52%), but 10 times as may samples.
Here is the simulation…
simulate_S = function(){
  
  lots_of_random_variables = rbernoulli(n = 10000, p = 1/2)
  X = mean(lots_of_random_variables)
  
  ## or, you could write:
  # X = rbinom(n = 1, size = 10000, prob = 1/2) / 10000
  
  S = X - 1/2
  return(S)
}
r = 10000
monte_carlo = data.frame(replicate = 1:r, 
                         S = rep(NA,r), 
                         S_in_suprising_set = rep(NA, r)) 
for(i in 1:r){
  monte_carlo$S[i] = simulate_S()
  monte_carlo$S_in_suprising_set[i] = check_if_S_in_surprising_set(monte_carlo$S[i])
 }
monte_carlo = as_tibble(monte_carlo)
monte_carlo %>% summarise(mean(S_in_suprising_set))## # A tibble: 1 x 1
##   `mean(S_in_suprising_set)`
##                        <dbl>
## 1                          0Ten times more samples makes 52%, 1000 times less likely! Whoa.
Discuss… What is a more productive variation on this question?
In 2018, 39,404 people in the US died in car accidents. In 2019, it was 38,800. So, it went down by roughly 1.8%. Is this a difference that could have happened by chance? Or, is this difference “statistically significant”?
Follow the logic of statistical testing above.
\(H_0: D_{2018}, D_{2019}\) are independent and identically distributed Poisson(\(\lambda = 39,102\))
\(S = \frac{D_{2018} - D_{2019}}{D_{2018}}.\)
Notice that under the null hypothesis, \(S\) is going to take values around zero. I think we should use a “two-sided test”, meaning that we will reject if the absolute value of \(S\) (i.e. \(|S|\)) is big. (Why two-sided? Well, if the number was going up, people would also have a strong reaction. So, whether \(S\) is large and positive or large and negative, we would take either as providing evidence against the null hypothesis.) So, we will want to evaluate
\(P(|S| > |s|)\)
\(s = \frac{d_{2019} - d_{2018}}{d_{2018}}\approx -1.5\%\)
s = (38800-39404)/(39404)
s## [1] -0.01532839library(tidyverse)
# First, write a function to simulate S
simulate_S = function(){
  D2018 = rpois(1,39102)
  D2019 = rpois(1,39102)
  S = (D2019-D2018)/(D2018)
  return(S)
}
# Second, write a function to evaluate whether S \in A.
check_if_S_in_A = function(S){
  return(abs(S) > abs(s))
}
# Now, we are going to do it lots of times.  
# Let's arrange the simulations in a data.frame with three columns
r = 10000
monte_carlo = data.frame(replicate = 1:r, 
                         S = rep(NA,r), 
                         S_in_A = rep(NA, r)) 
for(i in 1:r){
  # monte_carlo$S[i] = simulate_S()
  # monte_carlo$S_in_A[i] = check_if_S_in_A(monte_carlo$X[i])
 
# I'm going to use the alterative way...
 monte_carlo[i,2] = simulate_S()
 monte_carlo[i,3] = check_if_S_in_A(monte_carlo[i,2])
}
monte_carlo = as_tibble(monte_carlo)
monte_carlo %>% summarise(mean(S_in_A))## # A tibble: 1 x 1
##   `mean(S_in_A)`
##            <dbl>
## 1         0.0319monte_carlo %>% ggplot(aes(x = S)) + geom_histogram()See below table:
| Reject H0 | Not Reject H0 | |
|---|---|---|
| H0 True | Type I Error (\(\alpha\)) | Correct | 
| H0 False | Correct | Type II Error (\(\beta\)) | 
If the null is true and we do not reject it, or if the null is false and we reject it, we have done the right thing. If the null is true and we reject it, we have made the wrong conclusion. We say we have made a Type I error, or an \(\alpha\) error. We also call \(\alpha\) the probability of a Type I error, which can be expressed as follows: \[\alpha= P(\text{Reject}~H_0|H_0~\text{true})\]
If the null is false and we do not reject it, we have also made the wrong conclusion. We say we have made a Type II error, or a \(\beta\) error. We also call \(\beta\) the probability of a Type II error, which can be expressed as follows: \[\beta = P(\text{Not reject}~H_0|H_0~\text{false})\]
Very closely related to \(\beta\) is power: \[Power = 1-\beta= P(\text{Reject}~H_0|H_0~\text{false})\]