link to source

XKCD comic


Exercises

You can do these exercises individually, but we recommend you work on them in a small group. Complete all exercises.


1. Gender gap in chess

Due to myriad complex biological factors (e.g. higher average height, higher skeletal muscle mass, larger lung sizes, etc.. just to name a few), there continues to be a significant gender gap in performance in most sports at the top competitive level.

The same gender gap also exists in games like chess where it is less clear why this difference exists. Does the discrepancy mean that men are actually better at chess than women?

Dr. Wei Ji Ma refuted this idea in a recent article, arguing that participation level largely accounts for the observed differences in ratings of top chess players. Since women are so underrepresented in chess (between \(5\%\) and \(15\%\) depending on inclusion criteria (location, rating, age, etc.)), there is just simply a much lower chance of having female players who are as good as the top male players (similar to how in a longer sequence of coin flips, there are just more chances of having longer runs than in a shorter sequence).

Using official ELO rating data for Indian players obtained from FIDE (the international governing body for chess), he performed a permutation test that showed participation level can indeed largely account for the observed gender gap. For this exercise, you will reproduce his results.

  1. Download a (slightly preprocessed) data frame of the current FIDE standard (tournament) format ELO ratings. Import it normally using read_csv (.csv.gz just means it’s a csv that was compressed to save space; you can ignore this since most functions in R will automagically decompress and read them, so no extra work needed), and check it was correctly read by inspecting the first few rows with head().
# DO THIS: uncomment the lines below and complete them
# 
# ratings.standard = read_csv(...)
# head(...)
  1. Ma points out junior players (born on or after 2000) may have unreliable ratings. Remove these players from the data frame. Also remove non-Indian players, since we are trying to replicate his results.

    Then, make a density plot of players’ ratings, showing the male and female players as separate curves in the same plot window. Do they look very different?

# DO THIS: uncomment lines below and complete them
# note & can be used to filter 2 criteria (bday and country)
# 
# ratings.standard.in = ratings.standard %>% 
#   filter(... & ...)
# 
# ggplot(ratings.standard.in, aes(x=... , color=...)) + geom_density()

# don't forget to add appropriate title/axes!!
  1. Make a table showing the number of players, mean rating, and highest rating for each sex. (Hint: this can be easily done in one dplyr step; use group_by followed by a summarise with 3 values).

    What percent of these non-junior Indian players are women? How many points lower is the rating of the top female Indian player than the top male?

# DO THIS: uncomment the lines below and complete them
# 
# ratings.standard.in %>% 
#   group_by(...) %>% 
#   summarise(num  = ... ,  # for help, see dplyr.tidyverse.org/reference/n.html
#             mean = ... ,
#             max  = ... )
  1. Let the null hypothesis be that there is no actual gender gap in performance, i.e. that sex does not influence ratings. Assuming this is true, then we should be able to permute the sex column without affecting any observables.

    Write a function permutedMaxSexDiff(df,n) that, given a data frame df and a number of iterations n, will for each iteration permute the players’ sexes, recompute the difference in rating between the top male and top female players, and save the result to be returned.

    For efficiency (important in for loops), we implement this as shuffling the ratings, then computing \[- \max(\text{first $n_F$ elements of permuted ratings}) + \max(\text{rest of ratings vector})\] where \(n_F\) is the number of female players in the original data frame. Note this strategy is equivalent to permuting the player sexes and then computing the difference in each group (in both cases, we’re just randomly choosing \(n_F\) values as a group and comparing the maxes of both groups).

# DO THIS: uncomment function below and complete
# 
# 
# permutedMaxSexDiff = function(df,n){
#   
#   # preallocate vector of results
#   diffs = rep(NA,n)
#   
#   # find number of players to draw in each iteration
#   n.F = sum(df$Sex == "F")
#   
#   # start loop
#   for(i in 1:n){
#     
#     # permute ratings
#     permuted = sample(df$Rating)
#     
#     # find max of first n.F elements of vector
#     max.F = ...?
#     
#     # find max of rest of vector
#     max.M = ...?
#     
#     # find difference in max ratings and save in results vector
#     diffs[i] = ...?
#   }
#   
#   # return results vector
#   return(diffs)
# }
  1. Now, run the function with at least \(N=1000\) iterations (can you do \(10,000\)?) and visualize the results by making a density plot. Add a vertical line to denote the gender gap in the actual population of non-junior Indian chess players (top male rating - top female rating).

    Compute a \(p\)-value for your test (hint: if you do enough iterations, this should be roughly between 0.6 and 0.7). Is this result what you expect to observe under the null hypothesis? Are your results consistent with Ma’s results?

# DO THIS: uncomment the following lines and complete them
# 
# # run the function
# diffs = permutedMaxSexDiff(ratings.standard.in, ...?)
# 
# # use enframe to turn vector into data frame for plotting
# diffs %>% enframe(name=NULL,value="diff") %>%
#   ggplot(aes(...?)) + geom_density(adjust=2) + geom_vline(...?)
# 
# # compute p-value using mean(), this works since
# # TRUE is numerically treated as 1 and FALSE as 0, and
# # mean is just sum of elements divided by length
# pval = mean(diffs >= ...?)

REPLACE THIS TEXT WITH YOUR RESPONSE


2. Linear regression coefficient testing

To demonstrate the incredible versatility of Monte Carlo methods, we have a second example using the topic of linear regression. (In the future you will see more examples than just permutation tests, but they are extremely well suited to hypothesis testing so we will continue using them for now).

This exercise is written to be simpler than the last one, and should take you less time.

  1. Inspect the built-in example data frame mtcars that’s loaded with every R session. Briefly read the help page by running ?mtcars if you are unfamiliar with the dataset.

    There are many variables but today we will just focus on mpg (the response variable) and drat (a predictor variable). Make sure you know what these two variables refer to in real life (see help page!). Create a new data frame called mtcars.small with just these two columns and print this new data frame.

### do 2.a here
  1. Below, I’ve performed a simple linear regression, showing the coefficients. Take note of the coefficient estimate for the slope \(\beta_1\), which represents the strength of the linear relationship between the two variables (note this is NOT the same as significance; see this).

    One way you can estimate the significance of \(\beta_1\) (that is usually taught) is to use the linear regression model formulation to derive a \(t\)-statistic test for the coefficients and then compare the estimates to the null hypothesis that \(\beta_1=0\), i.e. there is NO linear relationship between the variables. This method has the advantage of possibly higher power at the cost of being more susceptible to heteroscedasticity (among other things).

    You can see from the output below the slope has an estimate of \(7.678\) and a standard error of \(1.507\). The \(t\)-statistic is then calculated as \((7.678-0)/1.507=5.096\) which has \(n-2=30\) degrees of freedom where \(n\) is the number of observations. We can then compute the \(p\)-value using 2*(1-pt(5.096,30)) and confirm it matches the table. (Don’t worry if you don’t fully understand this; however, note the process is almost identical to how you would perform an ordinary \(t\)-test!)

# DO THIS: uncomment code below and take note of slope estimate
# fist we run lm with mpg regressed on drat,
# data=mtcars.small means use columns from this data frame,
# then run coef() function to extract coefficients from model
# 
# summary(lm(mpg ~ drat, data=mtcars.small))$coefficients
# 2*(1-pt(5.096042,30))
  1. Another way you can estimate the significance of \(\beta_1\) under the null hypothesis of \(\beta_1=0\) is by (you guessed it!) permuting the coordinates! If indeed \(\beta_1=0\), then permuting the \(x\) coordinates (so they get assigned randomly to the \(y\) coordinates) should have no effect (since \(\beta_1=0\) implies the relationship is actually a flat line).

    Write a function that performs \(n\) iterations of the following:

    1. Randomly permute the \(x\) coordinates (drat values).
    2. Run linear regression using lm.
    3. Save the estimate for the slope of each iteration to be returned.

    Run your function with at least \(N=1000\) (can you get it to \(10,000\) or even higher?). What is the \(p\)-value you get from this method?

    • Note: If the method in part b. is accurate and the \(p\)-value is on the order of \(10^{-5}\), you may need \(10^5=100,000\) iterations to see a single case where the permuted slope is equal to or greater than in absolute value to the actual slope, so if you run a lower \(N\) and don’t see a single case, that’s ok, it just means the \(p\)-value is so small that your simulation size cannot detect it.
### do 2.c here