In this module, you’ll review a few basic hypothesis tests, and learn how to make R do the calculations for you.
If your memory of hypothesis testing is fresh, you may be able to skip the review parts of these sections. You are only expected to know the basic idea behind each test, not every detail.
A t-test is used when your hypotheses involve one or two mean values, such as \[ H_0: \mu_1 = \mu_2 \] \[ H_a: \mu_1 > \mu_2 \]
Functions: t.test()
in base R, or t_test()
in the infer
package.
##
## Welch Two Sample t-test
##
## data: bill_length_mm by sex
## t = -6.6725, df = 329.29, p-value = 1.066e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.865676 -2.649908
## sample estimates:
## mean in group female mean in group male
## 42.09697 45.85476
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "female" - "male", or divided in the order "female" / "male" for ratio-
## based statistics. To specify this order yourself, supply `order = c("female",
## "male")`.
## # A tibble: 1 x 6
## statistic t_df p_value alternative lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 -6.67 329. 1.07e-10 two.sided -4.87 -2.65
State (in words) the null and alternate hypotheses for the test in the code above.
State your conclusion.
A Chi-Square Test is used when your hypotheses involve counts or percents.
Similarly, one option is chisq.test()
in base R, which needs a two-way table as input.
The other option is chisq_test()
in infer
, which takes a data frame and variables as input. Be careful, though - the variables must be categorical to be appropriate for a chi squared test.
## Warning in chisq.test(my_table[, -1]): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: my_table[, -1]
## X-squared = 18.036, df = 4, p-value = 0.001214
mtcars %>%
mutate(
cyl = factor(cyl),
gear = factor(gear)
) %>%
chisq_test(
response = gear,
explanatory = cyl
)
## Warning in stats::chisq.test(table(x), ...): Chi-squared approximation may be
## incorrect
## # A tibble: 1 x 3
## statistic chisq_df p_value
## <dbl> <int> <dbl>
## 1 18.0 4 0.00121
Why did we include the [,-1]
in the first code chunk?
Why did we include the mutate()
step in the second code chunk?
What happens if you swap the response and explanatory variable in the second code chunk?
What do you conclude from this test?
The tests above, and other like them, assume a distribution of your test statistic.
We assume that a difference of sample means is approximately Normal (t), because of the Central Limit Theorem. There is also underlying math involved in showing that the test statistic for the Chi-Square test has - you guessed it! - a Chi-Square distribution.
These are called parametric tests.
However, sometimes we don’t feel comfortable that all our assumptions are met to assume a distribution, or perhaps we are interested in a test statistic that does not have an easy-to-derive distribution. In these cases. we might want to use a nonparametric test.
(Bootstrapping is a form of nonparametric analysis!)
The permutation test relies on random resampling of the data to determine how “extreme” the original data is.
(Stop at 8 minutes in.)
## Warning: Removed 11 rows containing missing values.
my_test <- penguins %>%
(response = bill_length_mm, explanatory = sex) %>%
(null = " ") %>%
(reps = 1000, type = " ") %>%
(stat = " ", order = c("male", "female"))
(my_test)
Fill in the yellow blanks with the appropriate functions from the infer
package.
Fill in the pink blanks with the appropriate arguments.