Lab: Simulation and the Central Limit Theorem
Introduction
In your coursework, you learned a bit about the Central Limit Theorem.
In this lab, you will use simulation to illustrate this concept.
Part One: Summarize fake data
- Create a new dataset containing four variables: one that comes from a Normal distribution, one from a Uniform distribution, one from a Binomial distribution, and one from an exponential distribution.
You should not use the default options for these distributions; e.g., your Normal data should not have a mean of 0 or a standard deviation of 1, and your Binomial data should not have a probability of 0.5.
Your data should have 30 rows.
(We did not learn about the exponential distribution in lecture; I would like you to figure out the r, d, p, q
functions for this new distribution yourself.)
Feel free to make up silly names for your fake variables, and/or to add fake names or other labels to this dataset, if you are inspired.
Calculate the mean and standard deviation for each of your four variables.
Now repeat steps (1) and (2), using the same distributions, but instead make 1000 rows in your dataset.
Comment on the means and standard deviations in this section, as compared to in (2).
- Make a histogram for each of your four variables, with the underlying distribution overlayed on top.
Part Two: Generating sample means
Write a function called
sample_mean
. This function should take as input a vectorvec
and an integern
. It should take a random sample of sizen
fromvec
, then calculate and return the mean of that subsample.Write a function called
many_sample_means
. This function should take as input a vectorvec
, an integern
, and an integerreps
. It should perform thesample_mean
process many times (reps
) and return a vector of the results.Write a function called
sample_means_ns
. This function should take as input a vectorvec
and an integerreps
, and a vectorns
. It should perform themany_sample_means
process for each of the values in thens
vector. It should return a data frame with the results.
For example, if ns <- c(5, 50, 500)
and reps = 2
, you would return something like:
## # A tibble: 6 x 2
## sample_mean n
## <dbl> <dbl>
## 1 3.48 5
## 2 1.32 50
## 3 1.72 500
## 4 3.24 5
## 5 4.40 50
## 6 7.53 500
Include the following in your final R Markdown to show your functions work:
Part Three: Putting it all together
For any two of the four variables in your fake dataset from Part One, do the following:
Use your
many_sample_means
function withreps = 1000
andn = 10
.- Make histograms of each of your results (no overlay required)
- Calculate the mean and standard deviation of each of your results.
Use your
many_sample_means
function withreps = 1000
andn = 500
.- Make histograms of each of your results (no overlay required)
- Calculate the mean and standard deviation of each of your results.
Comment on the differences or similarities between (1) and (2)
Use your
sample_means_ns
function to try a variety of values ofn
.
Calculate the standard deviation of the results for each value ofn
. Make a plot that shows how the standard deviation of the sample means changes withn
.
Part Four: Appreciate the CLT
You have been told that the amount of time you have to wait for a bus from Cal Poly to Downtown SLO is exponential(0.02)
; that is, that the true average wait time is about 50 minutes.
You think this might be a lie. In the last 30 days, you have waited for the bus for 55 minutes on average.
If the bus system is telling the truth about the exponential(0.02)
distribution, how unlucky were you this month?
Simulate 10000 random values from the
exponential(0.02)
distribution. Use yourmany_sample_means
on these values, withn = 30
andreps = 1000
. How many times did your sample mean exceed 55?Use the Central Limit Theorem to assume that a sample mean of exponentially distributed values is Normally distributed, with mean \(50\) and standard deviation \(50/\sqrt{n}\). Find the probability that a sample mean exceeds 55.
Comment on (1) and (2). Were the answers similar? Do you believe that bus wait times really are distributed
exponential(0.02)
?
Challenge 1:
The Central Limit Theorem works for the sample mean. Does it work for any other summary statistics?
Try out at least two other statistics. Some suggestions: * the median * the variance * the midhinge * the maximum
Write a very brief argument, using simulation and visualization, about whether or not the CLT works on the statistic in question.
Upload your writeup separately.
Challenge 2:
Put your 3 functions (sample_mean
, many_sample_means
, sample_means_ns
) into an R package in a github repo. You may copy the twelvedays
package and make changes to that infrastructure, to make this easier.