Reshaping and Joining Data Frames

In this lesson, you will learn to:

Reshape your dataset by pivoting longer or wider
Use mutating joins and filtering joins to combine different datasets with related information.
Decompose a complicated problem into small, manageable steps.

Time Estimates:

Videos: 20 min

Readings: 20-60 min

Activities: 60 min

Check-ins: 3

Reshaping Data

Decomposition

Required Reading: Computational Thinking

Please read the Decomposition and Hidden Assumptions sections only - stop when you reach the beginning of the Turn it into a recipe (let’s make an algorithm) section.

Tidy Data and Reshaping

Required Video: Reshaping Data (Pivoting)

Recommended Reading: R4DS 12.1-12.3: Tidy Data

Optional Reading: R4DS 12.4-12.5: Separate/Unite and Missing Values

(A few more tricks for data cleaning/wrangling, if you’re interested.)

Check-In 1: Pivoting

Question 1: Create a new dataset called cereals_3, that has three columns:

The name of the cereal
A column called “Nutrient” with values protein, fat, or fiber.
A column called “Amount” with the corresponding amount of the nutrient.

Question 2: Why didn’t we have to add a rowid to pivot wider in this case?

Check-In 2: Decomposition

Cereals in this dataset are placed on shelf 1, 2, or 3. We would like to know if these cereal placements correspond to different nutritional values; for example, perhaps sugary cereals made for children are on a lower shelf.

Create a new dataset called cereals_4, that has four columns:

The name of the manufacturer
The mean amount of sugar in cereals on shelf 1.
The mean amount of sugar in cereals on shelf 2.
The mean amount of sugar in cereals on shelf 3.

Joining data

Required Reading: Mutating Joins

Recommended Reading: R4DS Chapter 13: Relational Data

Check-In 3: Practice with Joins

The following code creates three datasets:

data1 <- data.frame(
  professor = c("Bodwin", "Glanz", "Carlton", "Sun"),
  undergrad_school = c("Harvard", "Cal Poly", "Cal Poly", "Harvard"),
  grad_school = c("UNC", "Boston University", "UCLA", "Stanford")
)

data2 <- data.frame(
  professor = c("Bodwin", "Glanz", "Carlton"),
  Stat_331 = c(TRUE, TRUE, TRUE),
  Stat_330 = c(FALSE, TRUE, TRUE),
  Stat_431 = c(TRUE, TRUE, FALSE)
)

data3 <- data.frame(
  course = c("Stat_331", "Stat_330", "Stat_431"),
  num_sections = c(8, 3, 1)
)

Here is what they look like once created:

##   professor undergrad_school       grad_school
## 1    Bodwin          Harvard               UNC
## 2     Glanz         Cal Poly Boston University
## 3   Carlton         Cal Poly              UCLA
## 4       Sun          Harvard          Stanford

##   professor Stat_331 Stat_330 Stat_431
## 1    Bodwin     TRUE    FALSE     TRUE
## 2     Glanz     TRUE     TRUE     TRUE
## 3   Carlton     TRUE     TRUE    FALSE

##     course num_sections
## 1 Stat_331            8
## 2 Stat_330            3
## 3 Stat_431            1

These datasets contain information about four Cal Poly professors, their educational history, the classes they are able to teach, and the number of sections of each class that need to be assigned.

We’d like to use all three datasets to figure out the best assignment of courses.

Question 1: Warm-up

Combine datasets 1 and 2 to make this:

## Joining, by = "professor"

##   professor undergrad_school       grad_school Stat_331 Stat_330 Stat_431
## 1    Bodwin          Harvard               UNC     TRUE    FALSE     TRUE
## 2     Glanz         Cal Poly Boston University     TRUE     TRUE     TRUE
## 3   Carlton         Cal Poly              UCLA     TRUE     TRUE    FALSE
## 4       Sun          Harvard          Stanford       NA       NA       NA

Combine datasets 1 and 2 to make this:

## Joining, by = "professor"

##   professor undergrad_school       grad_school Stat_331 Stat_330 Stat_431
## 1    Bodwin          Harvard               UNC     TRUE    FALSE     TRUE
## 2     Glanz         Cal Poly Boston University     TRUE     TRUE     TRUE
## 3   Carlton         Cal Poly              UCLA     TRUE     TRUE    FALSE

Combine datasets 2 and 3 to make this:

## Joining, by = "course"

## # A tibble: 9 x 4
##   professor course   can_teach num_sections
##   <chr>     <chr>    <lgl>            <dbl>
## 1 Bodwin    Stat_331 TRUE                 8
## 2 Bodwin    Stat_330 FALSE                3
## 3 Bodwin    Stat_431 TRUE                 1
## 4 Glanz     Stat_331 TRUE                 8
## 5 Glanz     Stat_330 TRUE                 3
## 6 Glanz     Stat_431 TRUE                 1
## 7 Carlton   Stat_331 TRUE                 8
## 8 Carlton   Stat_330 TRUE                 3
## 9 Carlton   Stat_431 FALSE                1

Question 2: Solve the puzzle

We would like to assign teachers for next year, with the following guidelines:

Each professor should be given 4 sections.
At least half of all 331 sections should be taught by Cal Poly graduates.
More than half of all 330 sections should be taught by UCLA graduates.
Each professor should teach at least 2 different classes.

Combine the three datasets into a single data frame that helps you find the answer to the question:

How many sections of Stat 331 will Dr. Glanz teach?