25 exercises
25.2 now look at your data in the spreadsheet
In this spreadsheet, what are the means for GPA and SAT? What are the standard deviations? what is the difference between ‘sort range’ and ‘sort sheet’ in Google Sheets?
What do you think about the randomnames package? Do the names appear “representative” (of what)? If you wanted to generate a more representative sample of names, how might you proceed?
25.3 using AI to help us here
Our intuitions about the idiosyncratic nature of names in the randomNames package is supported (not necessarily confirmed) by MS Copilot, which notes that “the randomNames package in R samples from a uniform distribution, which doesn’t reflect real-world name frequencies.”
For first names, one solution is to use the babynames data, and then to weight the probability of including the name in our sample by the actual frequency in (some) population:
# load babynames and Filter for a specific year or range
names_data <- babynames %>%
filter(year >= 2000) %>%
group_by(name) %>%
summarise(freq = sum(n)) %>%
ungroup()
# Sample names with probability proportional to frequency
set.seed(33458)
sampled_first_names <- sample(names_data$name, size = nrows, replace = TRUE, prob = names_data$freq) |>
as_tibble() |>
rename(NewFirstName = 1)
slice(sampled_first_names, 1:10)## # A tibble: 10 × 1
## NewFirstName
## <chr>
## 1 Tea
## 2 Emma
## 3 Olivia
## 4 Noemi
## 5 Ruger
## 6 Madelyn
## 7 Jeffrey
## 8 Caleb
## 9 Audrey
## 10 Elizabeth
25.3.1 for last names, it’s a little bit trickier
For last names, we’ll also use census data. There’s not an R package that has these, but there is Census data that is available in a zip file. It’s described at https://www2.census.gov/topics/genealogy/2010surnames/surnames.pdf. There are two files that are available - the top 1000 names, and then a (much longer) file of all names that occur more than 100 times. We’ll use the latter. Why do we do this?
Our approach to downloading a zip file directly into R follows: First, we specify a temporary file, then download the zip file into it. Then we unzip this and save it in a subdirectory called ‘data.’ Finally, we read it into R:
temp_zip_file <- tempfile(fileext = ".zip")
sourcefile <- "https://www2.census.gov/topics/genealogy/2010surnames/names.zip"
download.file(sourcefile, destfile = temp_zip_file, mode = "wb")
unzip(temp_zip_file, files = "Names_2010Census.csv", exdir = "data")
surnames <- read.csv("data/Names_2010Census.csv") |>
arrange(rank)
slice(surnames, 1:10)## name rank count prop100k cum_prop100k pctwhite pctblack pctapi pctaian pct2prace pcthispanic
## 1 ALL OTHER NAMES 0 29312001 9936.97 9936.97 66.65 8.53 7.97 0.86 2.32 13.67
## 2 SMITH 1 2442977 828.19 828.19 70.9 23.11 0.5 0.89 2.19 2.4
## 3 JOHNSON 2 1932812 655.24 1483.42 58.97 34.63 0.54 0.94 2.56 2.36
## 4 WILLIAMS 3 1625252 550.97 2034.39 45.75 47.68 0.46 0.82 2.81 2.49
## 5 BROWN 4 1437026 487.16 2521.56 57.95 35.6 0.51 0.87 2.55 2.52
## 6 JONES 5 1425470 483.24 3004.80 55.19 38.48 0.44 1 2.61 2.29
## 7 GARCIA 6 1166120 395.32 3400.12 5.38 0.45 1.41 0.47 0.26 92.03
## 8 MILLER 7 1161437 393.74 3793.86 84.11 10.76 0.54 0.66 1.77 2.17
## 9 DAVIS 8 1116357 378.45 4172.31 62.2 31.6 0.49 0.82 2.45 2.44
## 10 RODRIGUEZ 9 1094924 371.19 4543.50 4.75 0.54 0.57 0.18 0.18 93.77
25.3.2 cleaning these up
In the lastname data, there’s a problem. Some 10% of the names in the data - presumably rare ones - are replaced with “ALL OTHER NAMES.” We’ll just filter these out.
Finally, we combine the new NewFirstName and NewLastName with our original HS data. We also replace the SHOUTING (uppercase) in the NewLastName variable with Title Case:
Did we do this right? Any other issues? How might you have done this better?
surnames <- read.csv("data/Names_2010Census.csv") |>
filter(name != "ALL OTHER NAMES")
slice(surnames, 1:10)## name rank count prop100k cum_prop100k pctwhite pctblack pctapi pctaian pct2prace pcthispanic
## 1 SMITH 1 2442977 828.19 828.19 70.9 23.11 0.5 0.89 2.19 2.4
## 2 JOHNSON 2 1932812 655.24 1483.42 58.97 34.63 0.54 0.94 2.56 2.36
## 3 WILLIAMS 3 1625252 550.97 2034.39 45.75 47.68 0.46 0.82 2.81 2.49
## 4 BROWN 4 1437026 487.16 2521.56 57.95 35.6 0.51 0.87 2.55 2.52
## 5 JONES 5 1425470 483.24 3004.80 55.19 38.48 0.44 1 2.61 2.29
## 6 GARCIA 6 1166120 395.32 3400.12 5.38 0.45 1.41 0.47 0.26 92.03
## 7 MILLER 7 1161437 393.74 3793.86 84.11 10.76 0.54 0.66 1.77 2.17
## 8 DAVIS 8 1116357 378.45 4172.31 62.2 31.6 0.49 0.82 2.45 2.44
## 9 RODRIGUEZ 9 1094924 371.19 4543.50 4.75 0.54 0.57 0.18 0.18 93.77
## 10 MARTINEZ 10 1060159 359.40 4902.90 5.28 0.49 0.6 0.51 0.22 92.91
# then pull out a weighted sample of last names
sampled_last_names <- sample(surnames$name, size = nrows,
replace = TRUE, prob = surnames$count) |>
as_tibble() |>
rename(NewLastName = 1)
hsdata2 <- hsdata |>
bind_cols(sampled_first_names) |>
bind_cols(sampled_last_names) |>
mutate(NewLastName = stringr::str_to_title(NewLastName))
slice(hsdata2, 1:10)## # A tibble: 10 × 6
## FirstName LastName gpa sat NewFirstName NewLastName
## <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 Aaliyah Guttadore 0.72 510 Tea Macdonald
## 2 Aaqil Beltran 2.72 400 Emma Tornberg
## 3 Aaqil Robbe 0.75 400 Olivia Pedraja
## 4 Aaron Taylor 3.93 470 Noemi Carter
## 5 Aaron Baca 3.5 610 Ruger Roberts
## 6 Aaron Johnston 2 360 Madelyn Alicea
## 7 Aasim Warner 4.43 600 Jeffrey Tucker
## 8 Aasima Kim 3.41 430 Caleb Anderson
## 9 Aasima Lee 4.46 440 Audrey Medina
## 10 Aasiya Dillard 3.57 450 Elizabeth Vazquez