14 strings, factors, dates, and times

This chapter discusses some of the types of data other than numeric and logical, in particular strings, factors, and dates/times.

For the time being, please consider this as a supplement to R4DS, 2e, Chapter 14.

14.1 strings

Strings are sets of characters which may include “123” as well as “why *DID* the chicken cross the road?” Samples of text, from names to novels, are the most interesting type of string.

Among the tools that are used in examining texts are searches (do these tweets include language associated with hate speech?), validity checks (does the string correspond to a valid zip code?), and reformatting (to lower case so that BOB, Bob, and bob are all coded as identical). These ideas are simple, but quickly become challenging when, for example, the strings in which we are interested include characters that R usually interprets as code - such as commas, quotes, and slashes. See the section on string basics (14.2) for how to “escape” these characters, for example, how to treat a hashtag (#) as just a character as opposed to the beginning of a comment. These rules are codified as regular expressions (regex, sometimes regexp). Regex are not unique to R, but are shared with other languages as well.

In R, particularly in the tidyverse package, regex are typically implicit, represented within commands that are part of the stringr package and that typically begin with str_. For example, str_detect returns a set of logical values:

donuts <- c("glazed", "cakes", "Pink sprinkled",
            "cream filled",
            "day-old frosted", "chocolates")
donuts %>% 
    str_detect(" ")

## [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE

Most of the str_ functions are straightforward, but remember that str_sub provides a subset, not a substitution; to change a string, use str_replace. As Hadley points out in R4DS 2e, the autocomplete function in R_studio is very handy for helping you explore the different functions - in your console, type str_ … then scroll through the possibilities.

donuts %>% str_sub(1,5)

## [1] "glaze" "cakes" "Pink " "cream" "day-o" "choco"

donuts %>% str_replace(" ","_")

## [1] "glazed"          "cakes"           "Pink_sprinkled"  "cream_filled"   
## [5] "day-old_frosted" "chocolates"

As you work with texts, simple problems sometimes require sophisticated codes. The regex that are used to solve these problems quickly become dense and challenging.

One tool that can help you is the str_view command, which returns highlighted text showing corresponding passages. For example:

slashMovieTitles <- c("Face/off", 
                     "8 1/2", 
                     "F/X",
                     "Frost/Nixon",
                     "Victor/Victoria")
slashMovieTitles %>% str_view ("/")

## [1] │ Face</>off
## [2] │ 8 1</>2
## [3] │ F</>X
## [4] │ Frost</>Nixon
## [5] │ Victor</>Victoria

Regex statements are dense statements that allow us to work efficiently, for example, with special characters (like backslashes), repeated characters (zzz), and sets of characters [AEIOU], and characters at the beginning (^) or end ($) of strings. See R4DS 2e Chapter 15 for more.

A particularly useful function in stringr is str_split, which can be used to quickly break a text into discrete words. Note that using a space and the explicit “word” boundary give different results.

str_split(donuts, " ", simplify = TRUE)

##      [,1]         [,2]       
## [1,] "glazed"     ""         
## [2,] "cakes"      ""         
## [3,] "Pink"       "sprinkled"
## [4,] "cream"      "filled"   
## [5,] "day-old"    "frosted"  
## [6,] "chocolates" ""

str_split(donuts, boundary ("word"), simplify = TRUE)

##      [,1]         [,2]        [,3]     
## [1,] "glazed"     ""          ""       
## [2,] "cakes"      ""          ""       
## [3,] "Pink"       "sprinkled" ""       
## [4,] "cream"      "filled"    ""       
## [5,] "day"        "old"       "frosted"
## [6,] "chocolates" ""          ""

The output of str_split is generally a list (more on that soon), but here the lists are simplified into tibbles. In the tidyverse, str_split is typically one of the first steps in preparing text. The tidytext package (https://www.tidytextmining.com/), which is discussed at length in the computational social science course, builds on this foundation and is a powerful set of tools for all sorts of problems in formal text analysis.

14.2 factors

Conditions (experimental vs control), categories (male or female), types (scorpio, “hates astrology”) and other nominal measures are categorical variables or factors. In the tidyverse, the r package for dealing with this type of measure is forcats, one of the core parts of the tidyverse.

Here’s an example of a categorical variable. Why is it set up like this, and what does it do?

# Example of a factor
eyes <- factor(x = c("blue", "green",
                     "green", "zombieRed"), 
               levels = c("blue", "brown",
                          "green"))
eyes

## [1] blue  green green <NA> 
## Levels: blue brown green

In base R, string variables (“donut”, “anti-Brexit”, and “yellow”) are generally treated as factors by default. In the tidyverse, string variables are treated as strings until they are explicitly declared as factors.

The syntax for working with factors-as-categories is given in Chapter 16 of R4DS 2e. I will not duplicate that here, but I will point out that factors are represented internally in R as numbers, and converting (coercing) factors to other data types can occasionally lead to nasty surprises. Sections 16.4 and 16.5 describe how factors can be cleanly reordered and modified.

14.2.1 types of babies

In the babynames data, baby’s gender is a categorical variable, which is treated (because tidyverse) as a character or string. Here, we make it into a factor. We create two other factors as well.

# adding third level for non-binary babies
sexlevels <- c("M", "F", "O")
babynames2 <- babynames %>% 
    mutate(sex = factor(sex,
               levels =  sexlevels)) %>% 
    mutate(beginVowel = case_when(
        substr(name,1,1) %in%
            c("A","E","I","O","U") ~ "Vowel",
        TRUE ~ "Consonant")) %>% 
    mutate(beginVowel = factor(beginVowel)) %>% 
    mutate (century = case_when(
        year < 1900 ~ "19th",
        year < 2000 ~ "20th",
        year > 1999 ~ "21st")) %>% 
    mutate(century = factor(century))

Use the syntax above to create types of names for different generations (boomers, gen x, Millenials, gen z). Use https://www.kasasa.com/articles/generations/gen-x-gen-y-gen-z to determine your groupings.

Say something interesting about the data - names, genders, etc. Plot this.

14.2.2 types of grown-ups

If you would instead like to examine survey data, the forcats package includes a set of categorical variables.

Using the discussion in Chapter 15 of R4DS as your guide, examine the relationship between two or more of these categorical variables. Again, plot these

gss_cat

## # A tibble: 21,483 × 9
##     year marital         age race  rincome        partyid    relig denom tvhours
##    <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
##  1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
##  2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
##  3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
##  4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
##  5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
##  6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
##  7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
##  8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
##  9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
## 10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
## # ℹ 21,473 more rows

14.3 dates

The challenges of combining time-demarcated data (Chapter 17 of R4DS 2e) are significant. For dates, a variety of different formats (3-April, October 23, 1943, 10/12/92) must be made sense of. Sometimes we are concerned with durations (how many days, etc.); on other occasions, we are concerned with characteristics of particular dates (as in figuring out the day of the week on which you were born). And don’t forget about leap years.

In R, the lubridate package (a non-core part of the tidyverse, i.e., one that you must load separately) helps to handle dates and times smoothly. It anticipates many of the problems we might encounter in extracting date and time information from strings. Lubridate generally works well to simplify files with dates and times, and can be used to help in data munging. For example, in my analyses of the Corona data, dates and times were reported in four different ways. The code below decodes these transparently and combines them into a common date/time format .

2/3/20 6 PM

2/3/20 18:00

2/3/20 18:00:00

2020-02-03 18:00:00

# not run
#coronaData2 <- coronaData %>% mutate
#   (`FixedDate = 
#           parse_date_time(`Last Update`,
#                           c('mdy hp','mdy HM',
#                             'mdy HMS','ymd HMS')))

14.4 times

Working with temporal data is often challenging. The existence of, for example, 12 versus 24 hour clocks, time zones, and daylight savings, can make a simple question about duration quite challenging.

Imagine that Fred was born in Singapore at the exact moment of Y2K. He now lives in NYC. How many hours has he been alive as of right now? How would you solve this?

# find timezones for Singapore and NYC
# a = get datetime for Y2K in Singapore in UTC
# b = get datetime for now in NYC in UTC
# compute difference and express in sensible metric