link to source

XKCD comic


Exercises

You can do these exercises individually, but we recommend you work on them in a small group. Choose 2 of the following exercises to complete.


  1. Estimating Robbin’s constant (mean distance between points in a cube).
    1. Randomly generate 2 points \((x_1,y_1,z_1)\), \((x_2,y_2,z_2)\) uniformly in the unit cube a total of \(N\) times (at least \(1000\), but the more the better!)
      • hint: you can easily generate all the coordinates you need at once with runif(6*N), then reshape as an \(N\times 6\) matrix (one column for each coordinate component, with each row representing a pair of points) and then perform the arithmetic in the next step by using vectorized operations on the columns (i.e. using each column all at once) to improve computational efficiency.
      • if you are having difficulties with the above^ you can always use the naive way of running a for loop N times, where in each step of the loop you generate 2 points (6 coordinates total) and then perform the arithmetic in the next step.
    2. Next, compute the standard Euclidean distance between each pair of points and find the mean distance. (Bonus: plot the distribution of these distances!)
    3. Calculate your percentage error from the true value.
    4. Bonus: can you increase the accuracy of your estimation by using more points? How low can you get the error?
    5. Super bonus: Repeat the above for another 2D or 3D object of your choice (how about a triangle or a sphere?)
# Do exercise 2 here

  1. How many heads in a row should you see in \(N\) flips of a fair coin?
    1. Start by randomly flipping a fair coin (\(p=0.5\)) a total of \(N=10\) times (hint: use either rbernoulli function from purrr or rbinom with n=10 and size=1) and record how many heads (defined as a value of \(1\)) in a row you observe (this has been implemented in the function longestHeadRun function below for you.
    2. Repeat the above step \(M\) times (at least \(1000\) times, but this time, don’t use an extremely large \(M\), since we will repeat the previous step for other values of \(N\)). What is the mean length of the largest run of heads in \(10\) flips?
      • NOTE: \(N\) here is the size of each experiment (i.e. each experiment consists of \(N\) flips), whereas \(M\) is how many experiments are performed. It is common in Monte Carlo methods to have two types of parameters, one type for the properties of each experiment, and one type that determines how many experiments are done. Increasing \(N\) (number of flips in each experiment) will increase the mean-run-length, whereas increasing \(M\) (number of experiments) will increase the precision of your mean-run-length estimate for a particular number of flips.
    3. Now, repeat the above (you may use the same \(M\)) for at least 3 other values of \(N\) (again, feel free to do more if you wish!). Display your results in a table.
      • NOTE this step should be easy if you’ve written your code with good style. I recommend writing a function that does all the above for any given \(N\) and \(M\) and maybe \(p\), e.g. findMeanRun = function(N,M,p=0.5){......}. Then, for different values of \(N\) and \(M\) you can simply change the arguments given to the function, e.g. findMeanRun(10,1000) or findMeanRun(20,1000), etc, then put them in a data frame.
      • ALSO NOTE the above function syntax^ sets N and M as arguments to the function without default values, but sets 0.5 as the default value of the argument p. For a different example, see this.
    4. Validate your results against other people’s results (for example, this post). Are your results consistent with others?
    5. Bonus: run a few more values of \(N\) and plot the results, showing the mean run length vs number of flips \(N\). (bonus²: what happens if you increase \(M\)?)
    6. Super bonus if you still want MORE: Like the post referenced above, can you fit a smooth curve through the points?
# given output of rbernoulli or rbinom (a vector of 0's and 1's)
# compute the length of the longest continuous run of 1's
longestHeadRun = function(trials){
  with(rle(trials),max(c(0,lengths[values==1])))
}

# demo (output hidden for brevity)
longestHeadRun(c(0,0,0,0,0,0,0,0,0,0,0,0)) # returns 0
longestHeadRun(c(1,0,1,1,0,1,1,1,0,0,0,0)) # returns 3
# Do exercise 3 here

  1. Estimating a \(t\)-distribution with \(N-1\) degrees of freedom.
    1. Choose an arbitrary \(\mu\) and \(\sigma>0\) to use for the rest of the problem (you may choose the standard normal \(N(0,1)\) if you really wish, but where’s the fun in that?).
    2. Start by sampling \(N=2\) values from the normal distribution with mean \(\mu\) and standard deviation \(\sigma\) (note this counts as \(1\) experiment) and calculate the \(t\)-statistic of your sample. Recall the \(t\)-statistic for a sample \(X\) is defined as \[t=\frac{\overline{X}-\mu}{s/\sqrt{N}}~,~~~s=\sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(X_i-\overline{X})^2}\] where \(\overline{X}\) is the sample mean and \(s\) is the sample standard deviation
      • NOTE: Make sure you’re actually computing the \(s\) for this sample, NOT just using \(\sigma\) here!
      • You can use the built-in mean( ) and sd( ), but if you really want to do a completely manual Monte Carlo, feel free to compute the \(t\)-statistic yourself.
      • Also NOTE: Similar to the note in exercise 3.b., \(N\) here is the size of each experiment and \(M\) is how many experiments are performed. Increasing \(N\) gives a \(t\)-distribution with a different number of degrees of freedom (namely, \(N-1\)), whereas increasing \(M\) gives a more accurate estimate of each distribution of a particular degree.
    3. Repeat the above step \(M\) times (similar to exercise 3.b., use at least \(1000\) times, but don’t use an extremely large \(M\) since we will repeat this for other values of \(N\)).
    4. You’ve just simulated drawing from a \(t\)-distribution with \(N-1=1\) degree of freedom! Now plot the resultant values in a density plot.
    5. For comparison, plot the theoretical distribution with \(1\) degree of freedom (this page may be helpful). For best results, overlay this on top of the previous plot, but you’re having trouble with this, you can also plot them side-by-side.
    6. Repeat the above steps for at least 3 other values of \(N\) (for example 3, 6, 11, but feel free to choose your own or choose more than 3!). For each \(N\), plot both your simulated distribution and the theoretical distribution.
      • NOTE: again, like the note in exercise 3.c., this should be easy if you used a function!
    7. Bonus: What do you notice about these distributions? What do they converge to and why?
# Do exercise 4 here