• preface
    • the role of the liberal arts in data science
    • some features of the text
    • the book is for you
  • PART I: Introduction
  • 1 what is “data science for the liberal arts?”
    • 1.1 the incompleteness of the data science Venn diagram
      • 1.1.1 additional domains
      • 1.1.2 an additional dimension
    • 1.2 the importance of data science for society
      • 1.2.1 intelligence, artificial intelligence (AI), and careers
      • 1.2.2 the challenge of TMI
    • 1.3 discussion: what are your objectives in data science?
  • 2 getting started
    • 2.1 are you already a programmer and statistician?
    • 2.2 spreadsheets - some best practices
    • 2.3 setting up your laptop: some basic tools
    • 2.4 a (modified) 15-minute rule
    • 2.5 installing R and RStudio desktop
  • 3 what R stands for …
    • 3.0.1 base R and packages
    • 3.1 cha-cha-cha-changes
    • 3.2 some technical characteristics
    • 3.3 finding help
  • 4 exploring R world
    • 4.1 go to the movies
    • 4.2 go into the clouds
    • 4.3 open the box
    • 4.4 go to (data)camp
    • 4.5 learn to knit
    • 4.6 read (parts of) another introductory book
    • 4.7 older approaches
      • 4.7.1 using Swirl
      • 4.7.2 reading/watching Roger Peng’s text and/or videos
  • Part II: Towards data literacy
  • 5 now draw the rest of the owl
    • 5.1 time for hands-on experience
      • 5.1.1 consider loading the tidyverse
      • 5.1.2 now explore the babynames package
      • 5.1.3 or the (ggplot2)movies package
    • 5.2 assignment
  • 6 literate programming
    • 6.1 projects are directories
    • 6.2 scripts are files of code
    • 6.3 R markdown / Quarto documents combine scripts with comments and (once knit) results
    • 6.4 some elements of coding style
    • 6.5 What to do when you are stuck
  • 7 principles of data visualization
    • 7.1 some opening thoughts
    • 7.2 some early graphs
    • 7.3 Tukey and EDA
    • 7.4 approaches to graphs
    • 7.5 Tufte: first principles
    • 7.6 the politics of data visualization
      • 7.6.1 poor design leads to an uninformed or misinformed world
      • 7.6.2 poor design can be a tool to deceive
    • 7.7 the psychology of data visualization
      • 7.7.1 the power of animation
      • 7.7.2 telling the truth when the truth is unclear
      • 7.7.3 visualizing uncertainty
    • 7.8 exercises
      • Smartphone sales
      • Stand your ground
      • Big businesses
    • 7.9 further reading and resources
    • 7.10 notes for next revision
  • 8 visualization in R with ggplot
    • 8.1 a picture > (words, numbers)?
    • 8.2 Read Wickham’s opening chapter
    • 8.3 explore
  • 9 on probability and statistics
    • 9.1 on probability
    • 9.2 the rules of probability
      • 9.2.1 keeping conditional probabilities straight
    • 9.3 Bayes’ theorem
    • 9.4 continuous probability distributions
    • 9.5 dangerous equations
  • 10 reproducibility and the replication crisis
    • 10.1 answers to the reproducibility crisis
      • 10.1.1 partial answer 1: Tweak or abandon NHST
      • 10.1.2 partial answer 2: Keep a log of every step of every analysis in R markdown or Jupyter notebooks
      • 10.1.3 partial answer 3: Pre-registration of your research
    • 10.2 further readings
    • 10.3 notes for next revision
  • I Part III Towards data proficiency
  • 11 wrangling and tidying
    • 11.1 the structure of the tidyverse
    • 11.2 where should we eat?
      • 11.2.1 tall and wide formats
      • 11.2.2 exercises
    • 11.3 more on tidy coding
  • 12 finding, exploring, cleaning, and combining data
    • 12.1 florida educational data
      • 12.1.1 a digression: Slash, Windows and the world.
      • 12.1.2 getting data from our machine into R
      • 12.1.3 which ones are “high schools”?
      • 12.1.4 can we compute district (county) means from these data?
      • 12.1.5 estimating school enrollments
    • 12.2 combining datasets
      • 12.2.1 challenges in joining datasets
      • 12.2.2 some approaches to fixing the data
      • 12.2.3 estimating the relationship between economic disadvantage and graduation rates
    • 12.3 recap / on joining files
  • 13 applied data science
    • 13.1 public health and covid
      • 13.1.1 COVID data in 2025
      • 13.1.2 a brief digression on causality
      • 13.1.3 the excess mortality package
    • 13.2 other datasets in and beyond R
      • 13.2.1 make/extract/combine your own data
      • 13.2.2 keep it manageable
  • 14 strings, factors, dates, and times
    • 14.1 strings
    • 14.2 factors
      • 14.2.1 types of babies
      • 14.2.2 types of grown-ups
    • 14.3 dates
    • 14.4 times
  • 15 lists
  • 16 loops, functions, and beyond
    • 16.1 loops
    • 16.2 from loop to apply to purrr::map
    • 16.3 some examples of functions
      • 16.3.1 preliminaries
      • 16.3.2 the function
      • 16.3.3 applying the function
    • 16.4 how many bottles of what?
  • 17 from correlation to multiple regression
    • 17.1 bivariate analysis: Galton’s height data
      • 17.1.1 correlations based on small samples are unstable: A Monte Carlo demonstration
      • 17.1.2 from correlation to regression
    • 17.2 multivariate data
  • 18 cross-validation
    • 18.1 revisiting the affairs data
    • 18.2 avoiding capitalizing on chance
      • 18.2.1 splitting the data into training and test subsamples
    • 18.3 an example of cross-validated linear regression
      • 18.3.1 applying logistic regression analysis to the training data
  • 19 prediction and classification
    • 19.1 from regression to classification: selection of a threshold
      • 19.1.1 applying the model to the test data
      • 19.1.2 changing our decision threshold
      • 19.1.3 more confusion
      • 19.1.4 ROCs and AUC
    • 19.2 another approach to classification: k-nearest neighbor
      • 19.2.1 application: the affairs data
      • 19.2.2 from one doppelganger to many
      • 19.2.3 the Bayesian classifier
      • 19.2.4 Back to the affairs data
      • 19.2.5 avoiding capitalization on chance (again)
      • 19.2.6 the multinomial case
  • 20 machine learning: chihuahuas vs muffins, and other distinctions and ideas
    • 20.1 supervised versus unsupervised
    • 20.2 prediction versus classification
    • 20.3 understanding versus prediction
    • 20.4 bias versus variability
      • 20.4.1 resampling: beyond test, training, and validation samples
    • 20.5 compensatory versus non-compensatory problems
    • 20.6 a postscript: The Tidymodels packages
  • 21 working with text
    • 21.1 overview of key topics in text analysis
    • 21.2 a case study
      • 21.2.1 federal workers
      • 21.2.2 finding Reddit data
      • 21.2.3 some initial observations
      • 21.2.4 preprocessing
      • 21.2.5 comparing the words and stems
      • 21.2.6 Construction of differential word clouds
      • 21.2.7 bigrams
      • 21.2.8 categories of words
      • 21.2.9 Code for assessing LIWC effect sizes between 2 groups
    • 21.3 exercise: what would you do next?
  • 22 an introduction to networks
    • 22.1 introduction
      • 22.1.1 a simple example: networks and balance theory
    • 22.2 key network concepts
      • 22.2.1 centrality
      • 22.2.2 components and communities
      • 22.2.3 another example: it’s a small world
      • 22.2.4 static and dynamic networks
    • 22.3 some additional sources
  • 23 case study: the network structure of computational social science
    • 23.0.1 from citation network to structural network
    • 23.0.2 looking at the whole citation network
    • 23.0.3 the structural network: Centrality and community structure
    • 23.0.4 visualizing the network
    • 23.0.5 exploring the communities
  • 24 some ethical concerns for the data scientist
    • 24.1 ethics and personality harvesting
    • 24.2 the law of unintended consequences
    • 24.3 your privacy is my concern
    • 24.4 who should hold the digital keys?
    • 24.5 contact-tracing and COVID-19
    • 24.6 the digital divide
    • 24.7 still more case studies
    • 24.8 some potential remedies
    • 24.9 technology, change, and risk
  • 25 exercises
    • 25.1 generating correlated data (GPA and SAT) and putting it in a Google Sheet
      • 25.1.1 now look at your data in the spreadsheet
      • 25.1.2 using AI to help us here
      • 25.1.3 for last names, it’s a little bit trickier
      • 25.1.4 cleaning these up
    • 25.2 categorical probability and Venn diagrams
  • References

Data science for the liberal arts

Data science for the liberal arts

Kevin Lanning

2026-02-09