In this lesson, you will learn to:
Write regular expressions to match strings
Use special regular expression characters for general matching.
Use stringr()
functions to analyze text with regular expressions
Question 1:
Recall that the regular expression [abc]
matches the characters a
, b
, or c
.
What does [^abc]
match?
Question 2:
When it is not inside square brackets, the ^
symbol means “start of string”.
What will be returned by the following?
Question 3:
THe $
symbol in a regular expression means “end of string”.
What will be returned by the following?
What will the following outputs be?
my_str <- "The Dursleys of Number 4 Privet Drive were happy to say that they were perfectly normal, thank you very much."
str_extract_all(my_str, ".*")
str_extract_all(my_str, "\\w")
str_extract_all(my_str, "\\s")
str_extract_all(my_str, "[:alpha:]+")
str_extract_all(my_str, "[:alpha:]*\\.")
str_extract_all(my_str, "[wv]er[ey]")
my_str <- "The Dursleys of Number 4 Privet Drive were happy to say that they were perfectly normal, thank you very much."
str_extract_all(my_str, "[:digit:] ([A-Z][a-z]*)+")
str_extract_all(my_str, "(?<=[:digit:] )[:alpha:]+")
str_extract_all(my_str, "[:digit:].*Drive")
my_str %>%
str_split() %>%
str_extract("^[A-Z]")
The file hamlet_speech.txt
, posted on the course sit, contains the text of a famous speech from the play “Hamlet” by Shakespeare. Download this file and save it somewhere reasonable. Read it into R with:
Answer the following:
How many words are in the speech? (Hint: str_count
)
How many times does Hamlet reference death or dying?
How many sentences are in the speech?
What is the longest word in the speech?
What is the only capitalized word that does not start a sentence or line?
Hint: Right now, your object is a vector of type character, where each element is a line of the speech. You may want to use str_c()
(with appropriate arguments) to turn this into a single string. You may also want to turn it into a vector where each element is one word.
Or you may want to do all three! Different tasks will be easier with different object structures.