preface
the role of the liberal arts in data science
some features of the text
the book is for you
PART I: Introduction
1
what is “data science for the liberal arts?”
1.1
the incompleteness of the data science Venn diagram
1.1.1
additional domains
1.1.2
an additional dimension
1.2
the importance of data science for society
1.2.1
intelligence, artificial intelligence (AI), and careers
1.2.2
the challenge of TMI
1.3
discussion: what are your objectives in data science?
2
getting started
2.1
are you already a programmer and statistician?
2.2
spreadsheets - some best practices
2.3
setting up your laptop: some basic tools
2.4
a (modified) 15-minute rule
2.5
installing R and RStudio desktop
3
what R stands for …
3.0.1
base R and packages
3.1
cha-cha-cha-changes
3.2
some technical characteristics
3.3
finding help
4
exploring R world
4.1
go to the movies
4.2
go into the clouds
4.3
open the box
4.4
go to (data)camp
4.5
learn to knit
4.6
read (parts of) another introductory book
4.7
older approaches
4.7.1
using Swirl
4.7.2
reading/watching Roger Peng’s text and/or videos
Part II: Towards data literacy
5
now draw the rest of the owl
5.1
time for hands-on experience
5.1.1
consider loading the tidyverse
5.1.2
now explore the babynames package
5.1.3
or the (ggplot2)movies package
5.2
assignment
6
literate programming
6.1
projects are directories
6.2
scripts are files of code
6.3
R markdown / Quarto documents combine scripts with comments and (once knit) results
6.4
some elements of coding style
6.5
What to do when you are stuck
7
principles of data visualization
7.1
some opening thoughts
7.2
some early graphs
7.3
Tukey and EDA
7.4
approaches to graphs
7.5
Tufte: first principles
7.6
the politics of data visualization
7.6.1
poor design leads to an uninformed or misinformed world
7.6.2
poor design can be a tool to deceive
7.7
the psychology of data visualization
7.7.1
the power of animation
7.7.2
telling the truth when the truth is unclear
7.7.3
visualizing uncertainty
7.8
exercises
Smartphone sales
Stand your ground
Big businesses
7.9
further reading and resources
7.10
notes for next revision
8
visualization in R with ggplot
8.1
a picture > (words, numbers)?
8.2
Read Wickham’s opening chapter
8.3
explore
9
on probability and statistics
9.1
on probability
9.2
the rules of probability
9.2.1
keeping conditional probabilities straight
9.3
Bayes’ theorem
9.4
continuous probability distributions
9.5
dangerous equations
10
reproducibility and the replication crisis
10.1
answers to the reproducibility crisis
10.1.1
partial answer 1: Tweak or abandon NHST
10.1.2
partial answer 2: Keep a log of every step of every analysis in R markdown or Jupyter notebooks
10.1.3
partial answer 3: Pre-registration of your research
10.2
further readings
10.3
notes for next revision
I Part III Towards data proficiency
11
wrangling and tidying
11.1
the structure of the tidyverse
11.2
where should we eat?
11.2.1
tall and wide formats
11.2.2
exercises
11.3
more on tidy coding
12
finding, exploring, cleaning, and combining data
12.1
florida educational data
12.1.1
a digression: Slash, Windows and the world.
12.1.2
getting data from our machine into R
12.1.3
which ones are “high schools”?
12.1.4
can we compute district (county) means from these data?
12.1.5
estimating school enrollments
12.2
combining datasets
12.2.1
challenges in joining datasets
12.2.2
some approaches to fixing the data
12.2.3
estimating the relationship between economic disadvantage and graduation rates
12.3
recap / on joining files
13
applied data science
13.1
public health and covid
13.1.1
COVID data in 2025
13.1.2
a brief digression on causality
13.1.3
the excess mortality package
13.2
other datasets in and beyond R
13.2.1
make/extract/combine your own data
13.2.2
keep it manageable
14
strings, factors, dates, and times
14.1
strings
14.2
factors
14.2.1
types of babies
14.2.2
types of grown-ups
14.3
dates
14.4
times
15
lists
16
loops, functions, and beyond
16.1
loops
16.2
from loop to apply to purrr::map
16.3
some examples of functions
16.3.1
preliminaries
16.3.2
the function
16.3.3
applying the function
16.4
how many bottles of what?
17
from correlation to multiple regression
17.1
bivariate analysis: Galton’s height data
17.1.1
correlations based on small samples are unstable: A Monte Carlo demonstration
17.1.2
from correlation to regression
17.2
multivariate data
18
cross-validation
18.1
revisiting the affairs data
18.2
avoiding capitalizing on chance
18.2.1
splitting the data into training and test subsamples
18.3
an example of cross-validated linear regression
18.3.1
applying logistic regression analysis to the training data
19
prediction and classification
19.1
from regression to classification: selection of a threshold
19.1.1
applying the model to the test data
19.1.2
changing our decision threshold
19.1.3
more confusion
19.1.4
ROCs and AUC
19.2
another approach to classification: k-nearest neighbor
19.2.1
application: the affairs data
19.2.2
from one doppelganger to many
19.2.3
the Bayesian classifier
19.2.4
Back to the affairs data
19.2.5
avoiding capitalization on chance (again)
19.2.6
the multinomial case
20
machine learning: chihuahuas vs muffins, and other distinctions and ideas
20.1
supervised versus unsupervised
20.2
prediction versus classification
20.3
understanding versus prediction
20.4
bias versus variability
20.4.1
resampling: beyond test, training, and validation samples
20.5
compensatory versus non-compensatory problems
20.6
a postscript: The Tidymodels packages
21
working with text
21.1
overview of key topics in text analysis
21.2
a case study
21.2.1
federal workers
21.2.2
finding Reddit data
21.2.3
some initial observations
21.2.4
preprocessing
21.2.5
comparing the words and stems
21.2.6
Construction of differential word clouds
21.2.7
bigrams
21.2.8
categories of words
21.2.9
Code for assessing LIWC effect sizes between 2 groups
21.3
exercise: what would you do next?
22
an introduction to networks
22.1
introduction
22.1.1
a simple example: networks and balance theory
22.2
key network concepts
22.2.1
centrality
22.2.2
components and communities
22.2.3
another example: it’s a small world
22.2.4
static and dynamic networks
22.3
some additional sources
23
case study: the network structure of computational social science
23.0.1
from citation network to structural network
23.0.2
looking at the whole citation network
23.0.3
the structural network: Centrality and community structure
23.0.4
visualizing the network
23.0.5
exploring the communities
24
some ethical concerns for the data scientist
24.1
ethics and personality harvesting
24.2
the law of unintended consequences
24.3
your privacy is my concern
24.4
who should hold the digital keys?
24.5
contact-tracing and COVID-19
24.6
the digital divide
24.7
still more case studies
24.8
some potential remedies
24.9
technology, change, and risk
25
exercises
25.1
generating correlated data (GPA and SAT) and putting it in a Google Sheet
25.1.1
now look at your data in the spreadsheet
25.1.2
using AI to help us here
25.1.3
for last names, it’s a little bit trickier
25.1.4
cleaning these up
25.2
categorical probability and Venn diagrams
References
Data science for the liberal arts
Part II: Towards data literacy