Complete the exercises, update the “author” and “date” fields in the header, knit it, and submit both the HTML and RMD files to Canvas. Due date: Oct 27, 2021 at 11:59pm.
Suppose that we have iid random variables \(X_1,X_2,\dots\) drawn from a distribution with mean \(\mu\) and variance \(\sigma^2\). In lecture, we appealed to the CLT to say that as \(n\) gets large, the distribution of \[ \frac{ \bar{X} - \mu }{ \sqrt{ \sigma^2/n } } \] is well approximated by that of a standard normal, and we could use that fact to obtain a (approximate) 95% confidence interval for the parameter \(\mu\), since (letting \(Z\) be a standard normal RV) \[ 0.95 = \Pr[ -1.96 \le Z \le 1.96 ] \approx \Pr\left[ \bar{X} - 1.96 \sqrt{\sigma^2/n} \le \mu \le \bar{X} + 1.96 \sqrt{\sigma^2/n} \right]. \] One small problem, which we largely ignored in lecture (and in your introductory classes, most likely), is that we usually don’t know the variance \(\sigma^2\). Well, just as we can estimate the mean \(\mu = \mathbb{E} X_1\), we can estimate the variance \(\sigma^2 = \mathbb{E} (X_1 - \mu)^2\).
Well, we estimate \(\mu = \mathbb{E} X_1\) as \[ \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i, \] So by analogy, we should estimate \(\sigma^2\) as \[ \frac{1}{n} \sum_{i=1}^n (X_i - \mu )^2. \] Okay, but this includes the parameter \(\mu\), which we don’t know. But we have an estimate for the mean \(\mu\)– the sample mean \(\bar{X}\), so let’s just plug in \(\bar{X}\) for \(\mu\) in the definition of the variance.
This is the “plug-in principle”, which we have mentioned previously. If a parameter shows up in another quantity you want to estimate, you just plug in your estimate for that parameter.
Following that logic, a reasonable choice of estimator for the variance is \[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X} )^2. \]
Implement a function sigma2hat
that takes a vector and returns the estimate \(\hat{\sigma^2}\) defined above.
sigma2hat <- function( data ) {
muhat <- mean(data);
# TODO: finish the function. you should return a single number that stores your estimate of sigma^2.
}
Okay, we’ve got an estimator for \(\sigma^2\) implemented in R. How good is our estimator? Well, that turns out to be a harder question than it appears at first, and we’ll have lots more to say about it in the coming lectures when we talk about prediction.
For now, let’s ask a simpler question. In lecture, we have discussed the concept of bias. Recall that if \(S\) is an estimator of parameter \(\theta\), then the bias of \(S\) is given by \[ \operatorname{bias}( S ) = \mathbb{E} S(X_1,X_2,\dots,X_n ) - \theta. \] In statistics, we like unbiased estimators– estimators whose bias is zero. In the case of our estimator \(\hat{\sigma}^2\), this means \(\mathbb{E} \hat{\sigma}^2 = \sigma^2\).
Unbiased estimators are those whose expectation is equal to the parameter that they are supposed to estimate. Said another way, an unbiased estimator is one that is correct, on average.
We could do some math to assess whether or not \(\hat{\sigma}^2\) is unbiased, but let’s use Monte Carlo, instead. We’re going to repeatedly generate data, and for each sample we generate, we will apply our estimator \(\hat{\sigma}^2\) to that data and measure the difference between our estimate and the true value of \(\sigma^2\). If we repeat that experiment many times, we can estimate the bias, using the same logic as we discussed in our lectures on Monte Carlo.
Let’s first write a function to run one instance of this experiment, then we’ll write code to repeat the experiment multiple times.
run_trial <- function( nsamp ) {
# nsamp is the sample size, n in the math displays above.
# Let's generate our data from a standard normal-- then we know the variance already! (\sigma^2 = 1)
data <- rnorm( nsamp ); # recall, mean=0,sd=1 is the default.
# TODO: write code to apply our estimator to the data
# TODO: return the difference between our estimate and the truth.
# The expectation of this quantity should be the bias, as defined above.
}
Okay, we have a function that generates one random sample and measures the difference between our estimate and the truth.
To estimate the bias of \(\hat{\sigma}^2\), we need to run this same experiment multiple times (i.e., run many Monte Carlo replicates of the experiment).
Write a function to estimate the bias of \(\hat{\sigma}^2\) by repeatedly calling run_trial
and returning the average difference between \(\hat{\sigma}^2\) and \(\sigma^2\). Your function should take two arguments, a sample size nsamp
(i.e, \(n\) above) and a number of Monte Carlo replicates NMC
(i.e, the number of times we should repeat our experiment. Your function should return a single number corresponding to your estimate of the bias.
estimate_bias <- function( nsamp, NMC ) {
# nsamp is the sample size for use in run_trial above.
# NMC is the number of Monte Carlo replicates to use.
# TODO: code goes here.
# Hint:
# create a vector in which to store the MC replicates.
# Then try a for loop with NMC iterations.
# in each iteration, call run_trial, and store the resulting
# difference in your vector.
# Then return the average of that vector.
# Feel free to refer to the code from our Monte Carlo lectures for referece.
}
Use your code to estimate the bias of our estimator \(\hat{\sigma}^2\) using a sample size \(n=20\) and based on 2,000 Monte Carlo replicates.
Hint: you should find that the bias is approximately \(-1/n\), where \(n\) is the sample size. Of course, owing to the randomness in your experiments, this will not be exact.
#TODO: write code here.
You should have found above that \(\hat{\sigma}^2\) is (slightly) negatively biased. That is, \(\hat{\sigma}^2\) underestimates the true value of the variance. (for an extended discussion of the perils of under-estimating variance, see Nassim Taleb’s book The Black Swan: The Impact of the Highly Improbable.)
It turns out that we can show that \[ \mathbb{E} \hat{\sigma}^2 = \frac{n-1}{n} \sigma^2. \]
Bonus: prove this! (This is not worth any additional points, just your own pride in accomplishment). Hint: expand the square \((X_i - \bar{X})\) inside the sum in the definition of \(\hat{\sigma}^2\), apply the definition of \(\bar{X} = n^{-1} \sum_i X_i\), and use linearity of the expectation: \(\mathbb{E} (aX_1 + b X_2) = a \mathbb{X_1} + b \mathbb{X_2}.\)
This suggests using the estimator \[ \frac{n}{n-1} \hat{\sigma}^2, \] since in that case, \[ \mathbb{E} \frac{n}{n-1} \hat{\sigma}^2 = \frac{n}{n-1} \mathbb{E} \hat{\sigma}^2 = \frac{n}{n-1} \frac{n-1}{n} \sigma^2 = \sigma^2. \] That is, our adjusted estimator is unbiased!
Repeat the above Monte Carlo experiment to estimate the bias of this new estimator. You should find that it is quite close to zero (of course, it will still not be exactly zero, because of random variation).
#TODO: define analogues of the above functions, this time for our adjusted estimator.
# An easy approach is to define
sigma2hat_adjusted <- function( data ) {
# TODO: code goes here; analogous to sigma2hat above.
}
run_trial_adjusted <- function( nsamp ) {
# nsamp is the sample size, n in the math displays above.
# TODO: code goes here; analogous to run_trial above.
}
estimate_bias_adjusted <- function( nsamp, NMC ) {
# nsamp is the sample size for use in run_trial above.
# NMC is the number of Monte Carlo replicates to use.
# TODO: code goes here, analogous to estimate_bias above.
}
So, suppose that we plug in our estimates \(\bar{X}\) for \(\mu\) and \(\hat{\sigma}^2\) (or its adjusted version) for \(\sigma^2\). Then we are saying that \[ \frac{ \bar{X} - \mu }{ \sqrt{ \hat{\sigma}^2/n }} \] is approximately normal. This is true, in that as \(n\) grows, this quantity comes to have the same CDF as a standard normal. But this ignores one important point– for smaller values of \(n\), the random variation of \(\hat{\sigma}^2\) actually matters, and the t-statistic is a better approximation. This is precisely why we use the t-statistic instead of a simple standard normal in testing situations where we don’t know \(\sigma^2\). You may have seen this in your introductory courses without fully understanding why– now you know!
This a nice example where when \(n\) is large, it doesn’t much matter what we do– everything ends up being approximately normal for \(n\) large. On the other hand, for finite \(n\) (e.g, real-world situations where \(n\) is on the order of 20 or 100), the different choices of approximation matter a great deal.
In lecture, we saw two different approaches to building confidence intervals.
One thing we didn’t discuss in lecture is how these two methods compare (except to say that… it depends).
This problem is fairly open-ended. Your job is to set up an experiment comparing these two different confidence interval methods.
First things first, we need to generate data. Define a function generate_data
, that takes two arguments, lambda
and nsamp
, where lambda
is the parameter of a Poisson and nsamp
is a positive integer specifying a number of samples. Your function should return a vector of length nsamp
, whose entries are drawn iid from Poisson with parameter given by lambda
. lambda
should default to 1.
generate_data <- function( nsamp, lambda=1 ) {
# TODO: write code here.
}
Our simulation-based method requires that we have an estimate of the mean, and our CLT-based method requires that we have estimates of the mean and variance. Write functions estimate_pois_mean
and estimate_pois_var
to estimate both the mean and variance of the Poisson given the output of generate_data
.
Hint: the Poisson has the convenient property that the expectation and variance are both equal to \(\lambda\), so these can be the same function– both can just return the sample mean, if you want, but you can also use the sample variance. Your choice!
estimate_pois_mean <- function( data ) {
# TODO: code goes here. Estimate lambda as the sample mean.
}
estimate_pois_var <- function( data ) {
# TODO: code goes here. Estimate the variance.
# One option is to be lazy and use the fact that mean=var=lambda in Poisson
# Like this: return( estimate_pois_mean( data ) );
}
Implement a function CLT_CI
that takes arguments data
(the vector of data generated by generate_data
) and alpha
(a numeric between 0 and 1, i.e, a probability, which should default to \(0.05\)), and returns a confidence interval (i.e., a vector of length 2 whose first entry is the left end of the interval and whose second entry is the right end of the interval).
This confidence interval should be based on our CLT-based approximation. That is, you should use the fact that \[ \frac{ \bar{X} - \lambda }{ \sqrt{ (\operatorname{Var X_1})/n }} \] is approximately normal.
Use your estimate_pois_var
function to estimate the variance term in the denominator (ignore the fact that this is equal to \(\lambda\)), use estimate_pois_mean
to get \(\bar{X}\) and follow the basic outline from lecture to create a \((1-\alpha)*100\)-percent two-sided confidence interval (if “two-sided” is not familiar to you, no worries– just follow the recipe from lecture and you’ll be fine!).
CLT_CI <- function( data, alpha=0.05) {
# TODO: code here.
# Step 1. use data to get Xbar and the variance estimate.
# Step 2. Use the estimated variance and choice of alpha to construct
# a two-sided confidence interval for lambda
# (sorry, Z-scores are going to show up here, but only briefly!)
# Return the CI. Be careful-- depending on how you got your CI, it
# might not be a simple vector (e.g., it might have a header).
# Feel free to use the lecture code for reference.
}
Implement a function simulation_CI
that takes arguments data
(the vector of data generated by generate_data
) and alpha
(a numeric between o and 1, i.e, a probability, which should default to \(0.05\)), and returns a confidence interval (i.e., a vector of length 2 whose first entry is the left end of the interval and whose second entry is the right end of the interval).
Your function should construct the confidence interval according to the simulation-based approach discussed in class. You are free to adapt the code from lecture in your code. You may pick the number of Monte Carlo iterates (i.e., replicates) as you wish, but we would recommend setting it to be at least a few hundred.
simulation_CI <- function( data, alpha=0.05) {
# TODO: code goes here.
# Return a vector c(L, U) with L the left end and U the right ("upper") end of the CI.
}
So, now you’ve implemented two different confidence intervals. Let’s compare them.
First of all, we need to be able to check if our CI “caught” the true parameter or not.
Write a function contained
that takes a CI (i.e., a two-vector with first entry smaller than the second) and a number (the true value of the parameter) that returns TRUE if the true value of the parameter is inside the interval and FALSE otherwise.
contained <- function( myCI, trueparam ) {
# TODO: code goes here.
# Hint: the estimation lecture notes include code to do this.
}
What does it mean for one confidence interval to be better than another? We define the coverage of a confidence interval to be the probability that the true parameter lies in the interval.
Ideally, a \(1-\alpha\) confidence interval has coverage \(1-\alpha\). Simple!
The problem arises from approximation– our CLT- and simulation-based CIs are based on approximations, and so their coverage will not be exactly \(1-\alpha\).
Implement a function called estimate_coverage
, whose arguments are given in the code block below, and which returns an estimate of the coverage (i.e., a number between 0 and 1).
CI_fn
is a function (in our code, this will be either CLT_CI
or simulation_CI
). Yes, functions can be arguments to other functions in R! Note that this argument will just be a function. Not a string.NMC
is a positive integer, the number of replicates to run.nsamp
is a positive integer, the number of samples.lambdatrue
is a positive real, the parameter of the Poisson that we will use to generate our data. It should default to 1.alpha
is a numeric between 0 and 1. 1-alpha
will be the (target) level of our CI.estimate_coverage <- function( CI_fn, NMC, nsamp, lambdatrue=1, alpha=0.05 ) {
coverages <- 0; # Count how often the CI "catches" lambdatrue
for (i in 1:NMC ) {
# Generate data: nsamp draws from Poisson( lambdatrue )
data <- rpois(n=nsamp, lambda=lambdatrue);
# Construct a confidence interval from it using the given CI function
# if lambdatrue is in the CI, increment coverages.
}
return( coverages/NMC )
}
Wow, that was a lot of coding! Here comes the payoff!
Use the code above to estimate the coverage of your two CI methods. Try changing the different arguments– \(\lambda\), \(\alpha\), nsamp
, etc. Coverages of the two methods should both be close to \(1-\alpha\), whatever you chose \(\alpha\) to be, but you will likely find that they tend to follow patterns, consistently either over- or under-shooting the target.
Take some time to explore these patterns. What do you see? There are no right or wrong answers, here. We are just asking you to see how these two different CI methods behave in different situations.
An important point that we alluded to above is that in the Poisson, the mean and variance are both just \(\lambda\). This is actually part of why you may have noticed some weird behavior in your experiments above. Unfortunately, a thorough explanation of that weird behavior will have to wait for your later theory courses.
TODO: observations go here.
The file mule_kicks.csv
, available for download (here)[https://kdlevin-uwstat.github.io/STAT340-Fall2021/hw/03/mule_kicks.csv], contains a simplified version of a very famous data set. The data consists of the number of soldiers killed by being kicked by mules or horses each year in a number of different companies in the Prussian army near the end of the 19th century.
This may seem at first to be a very silly thing to collect data about, but it is a very interesting thing to look at if you are interested in rare events. Deaths by horse kick were rare events that occurred independently of one another, and thus it is precisely the kind of process that we might expect to obey a Poisson distribution.
Download the data and read it into R by running
download.file('https://kdlevin-uwstat.github.io/STAT340-Fall2021/hw/03/mule_kicks.csv', destfile='mule_kicks.csv');
mule_kicks <- read.csv('mule_kicks.csv', header=TRUE);
head(mule_kicks);
## deaths
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
mule_kicks
contains a single column, called deaths
. Each entry is the number of soldiers killed in one corps of the Prussian army in one year. There are 14 corps in the data set, studied over 20 years, for a total of 280 death counts.
Assuming that the mule kicks data follows a Poisson distribution, produce a point estimate for the rate parameter \(\lambda\). There are no strictly right or wrong answers, here, though there are certainly better or worse ones.
#TODO: estimate the rate parameter.
lambdahat <- NA; # TODO: write code; store your estimate in lambdahat.
Using everything you know (Monte Carlo, CLT, etc.), construct a confidence interval for the rate parameter \(\lambda\). Explain in reasonable detail what you are doing and why you are constructing the confidence interval in this way (a few sentences is fine!).
TODO: Code, plots, and/or explanation go here.
Here’s a slightly more open-ended question. We assumed that the data followed a Poisson distribution. This may or may not be a reasonable assumption. Use any and all tools that you know about to assess how reasonable or unreasonable this assumption is.
Once again, there are no strictly right or wrong answers here. Explain and defend your decisions and thought processes in a reasonable way and you will receive full credit.
TODO: Code, plots and/or explanation go here.