In this lesson, you will learn to:
Reshape your dataset by pivoting longer or wider
Use mutating joins and filtering joins to combine different datasets with related information.
Decompose a complicated problem into small, manageable steps.
Please read the Decomposition and Hidden Assumptions sections only - stop when you reach the beginning of the Turn it into a recipe (let’s make an algorithm) section.
(A few more tricks for data cleaning/wrangling, if you’re interested.)
Question 1: Create a new dataset called cereals_3
, that has three columns:
The name of the cereal
A column called “Nutrient” with values protein
, fat
, or fiber
.
A column called “Amount” with the corresponding amount of the nutrient.
Question 2: Why didn’t we have to add a rowid
to pivot wider in this case?
Cereals in this dataset are placed on shelf 1, 2, or 3. We would like to know if these cereal placements correspond to different nutritional values; for example, perhaps sugary cereals made for children are on a lower shelf.
Create a new dataset called cereals_4
, that has four columns:
The name of the manufacturer
The mean amount of sugar in cereals on shelf 1.
The mean amount of sugar in cereals on shelf 2.
The mean amount of sugar in cereals on shelf 3.
The following code creates three datasets:
data1 <- data.frame(
professor = c("Bodwin", "Glanz", "Carlton", "Sun"),
undergrad_school = c("Harvard", "Cal Poly", "Cal Poly", "Harvard"),
grad_school = c("UNC", "Boston University", "UCLA", "Stanford")
)
data2 <- data.frame(
professor = c("Bodwin", "Glanz", "Carlton"),
Stat_331 = c(TRUE, TRUE, TRUE),
Stat_330 = c(FALSE, TRUE, TRUE),
Stat_431 = c(TRUE, TRUE, FALSE)
)
data3 <- data.frame(
course = c("Stat_331", "Stat_330", "Stat_431"),
num_sections = c(8, 3, 1)
)
Here is what they look like once created:
## professor undergrad_school grad_school
## 1 Bodwin Harvard UNC
## 2 Glanz Cal Poly Boston University
## 3 Carlton Cal Poly UCLA
## 4 Sun Harvard Stanford
## professor Stat_331 Stat_330 Stat_431
## 1 Bodwin TRUE FALSE TRUE
## 2 Glanz TRUE TRUE TRUE
## 3 Carlton TRUE TRUE FALSE
## course num_sections
## 1 Stat_331 8
## 2 Stat_330 3
## 3 Stat_431 1
These datasets contain information about four Cal Poly professors, their educational history, the classes they are able to teach, and the number of sections of each class that need to be assigned.
We’d like to use all three datasets to figure out the best assignment of courses.
Question 1: Warm-up
## Joining, by = "professor"
## professor undergrad_school grad_school Stat_331 Stat_330 Stat_431
## 1 Bodwin Harvard UNC TRUE FALSE TRUE
## 2 Glanz Cal Poly Boston University TRUE TRUE TRUE
## 3 Carlton Cal Poly UCLA TRUE TRUE FALSE
## 4 Sun Harvard Stanford NA NA NA
## Joining, by = "professor"
## professor undergrad_school grad_school Stat_331 Stat_330 Stat_431
## 1 Bodwin Harvard UNC TRUE FALSE TRUE
## 2 Glanz Cal Poly Boston University TRUE TRUE TRUE
## 3 Carlton Cal Poly UCLA TRUE TRUE FALSE
## Joining, by = "course"
## # A tibble: 9 x 4
## professor course can_teach num_sections
## <chr> <chr> <lgl> <dbl>
## 1 Bodwin Stat_331 TRUE 8
## 2 Bodwin Stat_330 FALSE 3
## 3 Bodwin Stat_431 TRUE 1
## 4 Glanz Stat_331 TRUE 8
## 5 Glanz Stat_330 TRUE 3
## 6 Glanz Stat_431 TRUE 1
## 7 Carlton Stat_331 TRUE 8
## 8 Carlton Stat_330 TRUE 3
## 9 Carlton Stat_431 FALSE 1
Question 2: Solve the puzzle
We would like to assign teachers for next year, with the following guidelines:
Each professor should be given 4 sections.
At least half of all 331 sections should be taught by Cal Poly graduates.
More than half of all 330 sections should be taught by UCLA graduates.
Each professor should teach at least 2 different classes.
Combine the three datasets into a single data frame that helps you find the answer to the question:
How many sections of Stat 331 will Dr. Glanz teach?