link to source

XKCD comic


Exercise

Today, we will be using real data from the US Geological Survey (USGS) to build a model to estimate the frequency of earthquakes across the world. As usual, we encourage you to work in small groups, but this is up to you.


Background

In seismology (the study of earthquakes), the relationship between the frequency and magnitude of earthquakes (in a certain place and period of time) can be modeled by the Gutenberg-Richter law (GR law). Let \(M\) be the Richter magnitude of a seismic event, and \(N\) be the number of events with magnitude greater than or equal to \(M\). Then, the GR law states that \[N=10^{a-bM}\] or in other words \[\log_{10}(N)=a-bM\] where \(a\) and \(b\) are unknown constants that depend on the particular choice of place/period. Note that this relationship should appear as a line on a plot of \(\log_{10}(N)\) vs \(M\).


Data import

This dataset contains every earthquake (on Earth) of magnitude 4.5 or greater that was detected by USGS from beginning of 2011 to end of 2020. A detailed description of the columns can be found here, but the main variables we are interested in are time and mag.binned (magnitude rounded to nearest half integer).

For convenience, much of the data cleaning and preprocessing has already been done for you; download the prepared data here and save into the same directory as this source file. Then, uncomment and run the lines of code below to import the data.

# library(tidyverse)
# library(lubridate)
# 
# parseTime = function(times){
#   as.POSIXct(strptime(times,"%Y-%m-%d %H:%M:%OS",tz="UTC"))
# }
# 
# quakes = read_csv("quakes.csv.gz", col_types="TddddcddddccTccddddccc") %>%
#   mutate_at(c("time","updated"),parseTime)


# NOTES:
# col_types tells R precisely what the column types are, to avoid wrong type errors
# parseTime here is a function to help convert the datetime strings in this file to datetime format
# mutate_at is used to apply parseTime to both the "time" and "updated" columns
# also, recall that .gz means it's been compressed using the gz program, which saves a lot of space,
#   but R can still directly read it as if it's an ordinary csv file (underrated feature)

For our purposes, this data needs to be summarized to obtain \(N\) vs \(M\) values for each year, where \(N\) is the number of earthquakes with magnitude at least as strong as a given value of \(M\) (see wiki page on GR law for more details). Since this is a bit tricky, it’s also been done for you below.

# quakes.count = quakes %>%
#   count(year,mag.binned,name='count') %>%
#   group_by(year) %>%
#   mutate(N=rev(cumsum(rev(count))), year=as.character(year))

# NOTES:
# count(year,mag.binned) counts how many events with that magnitude occurred each year
#   (see https://dplyr.tidyverse.org/reference/count.html for more info)
# 
# group_by(year) followed by cumsum(...) takes the cumulative sum in each year
# the rev(...rev(...)) runs this cumsum in reverse (otherwise we get ≤M instead of ≥M)

Before moving onto the next step, inspect the data frame to make sure you completely understand what it represents and check that everything looks right. (Your first and last rows should look something like 2011, 4.5, 7430, 10636 and 2020, 7, 10, 10).

# print(quakes.count,n=Inf)


Visualization

As usual, the first step is to visualize the data. Make a scatter plot of \(\log_{10}(N)\) vs \(M\) (note this means \(M\) is on the horizontal axis (you say \(y\) vs \(x\), NOT \(x\) vs \(y\))). Then, add a line plot on top, making sure the years are correctly grouped together and distinguished from each other (use something like color or linetype (or both!), or even something else, (use your judgment to determine what looks best)).

Note: you can either use log10(N) as the y aesthetic, OR directly use y=N and just rescale the axis to be logarithmic using scale_y_log10(). I recommend this second method since it makes the axis easier to read (see here for an example).

Ideally, it might look something like this (don’t forget to add nice title/labels!)

# ggplot(quakes.count,aes(...)) + ...


Estimation

Next, we will fit a simple linear regression to the data to estimate \(a\) and \(b\). Complete the line of code below to fit the model (don’t forget the linear relationship is NOT between \(N\) and \(M\) but rather between \(\log_{10}(N)\) and \(M\), so adjust your model formula accordingly!).

# lm.quakes = lm(...)

View a summary of the model to see coefficients’ estimates, \(p\)-values, and other relevant info.

# summary(lm.quakes)

From your fit, what are your estimates for the constants \(a\) and \(b\)? Pay careful attention to the signs here! (hint: remember the GR law uses the convention \(a-bM\) whereas R assumes you are fitting intercept + slope * M, so therefore your fitted slope = \(-b\)).

Try to avoid copy-pasting or manually typing in values. The coef() function lets you extract coefficients from a model object (e.g. lm.quakes). You can then use [i] to access the i-th value of this coefficients vector. You may also want to use unname() at the end to remove the name from the value (if you don’t, it may carry through later calculations and no longer be a correct name for the output value).

# a = ...?
# b = ...?

(Hint: if you did this correctly, a+b should evaluate to approximately 9.68)


Checking fit

It’s always nice to visually check your fit to make sure everything looks right. Plot your line of best fit along with the data points in the chunk below.

Hint: this time, I recommend using log10(N) as the y aesthetic, then using geom_abline() to plot your line of regression using your estimated slope and intercept values (this avoids dealing with distorted axis cause by the other method’s scale_y_log10() which can be non-intuitive to deal with). Note you can use your variables a and b here that were defined previously to avoid manually typing in numerical estimates (which is almost always bad!).

# ggplot(quakes.count,aes(...)) + ...

You can also check the residuals of your fit. This is fairly convenient to do in base R. Note there is some heteroscedasticity due to the fact that higher magnitude earthquakes occur much less often than lower magnitude earthquakes so it’s harder to estimate them as precisely. This is expected and not a huge problem. Normality is mostly satisfied.

# par(mfrow=c(1,2))
# plot(lm.quakes,which=1:2)


Confidence & prediction

Give \(95\%\) confidence intervals for \(a\) and \(b\) (help page).

# confint(...)

Give a brief interpretation of these intervals.

REPLACE TEXT WITH RESPONSE

From the GR law, we can deduce that the total number of earthquakes of ANY magnitude is equal to \[N_\text{Total}=10^a\] Using your estimate for \(a\), approximately how many earthquakes are there in total every year on Earth? (Remember to think about how precisely you can estimate this and round your answer appropriately! Not all the digits you get from R are significant, so don’t present all of them!).

Use the box below to compute your answer, then respond with a short sentence below.

# perform calculations here

REPLACE TEXT WITH RESPONSE

Using your \(95\%\) confidence interval for \(a\), give an approximate \(95\%\) confidence interval for \(N_\text{Total}\).

# perform calculations here

REPLACE TEXT WITH RESPONSE

According to your model, how many earthquakes of magnitude 8 or greater do you expect to see on average every year?

# perform calculations here

REPLACE TEXT WITH RESPONSE

According to your model, you would expect to see an earthquake with magnitude between 9.5 and 10 on average once every how many years?

# perform calculations here

REPLACE TEXT WITH RESPONSE