7 on probability and statistics
We previously considered Anscombe(1973) and his quartet, and how visualizing data is valuable. This week, we move to a brief discussion of principles of statistics.
7.1 on probability
Discrete probability is used to understand the likelihood of categorical events. We can think of initial estimates of probability as subjective or personal. For some events (what is the probability this plane will crash?), an estimate of probability can be drawn from a base rate or relative frequency (e.g., p(this plane will crash) = (number of flights with crashes/ number of flights).
For other events (e.g., what is the probability that a US President will resign or be impeached before completing their term of office?), it may be hard to arrive at a suitable base rate. Here, a number of subjective beliefs or principles may be combined to arrive at a subjective or personal probability. In a sense, all probability estimates begin with a personal belief such as this, in part because the choice of the most informative base rate is often not self-evident - in the plane crash example, we might consider crash rates among all planes, all jets, and all carriers, or particular planes (Boeing 737 Max jets), particular carriers (United), or the intersection of some or all of these as well as other variables (United 737 Max airliners flying out of LaGuardia at night). There is no single answer to the plane crash estimate, in other words.
Similarly, a baseball manager, in considering whether a pinch hitter might be brought in to bat at a crucial spot in a game, might consider an omnibus batting average (effectively a relative frequency of hits/opportunities), batting average at night, against this pitcher, etc.
In general, there is not a correct answer to this “problem of the reference class” in part because a more precise reference group (737 Max planes, batting against a particular pitcher) is inherently based on a smaller sample of data, and is therefore less stable, than a broader, but coarser reference group upon which a probability estimate might also be based (Lanning 1987).
The personal origins of probability estimates should become less important as we are exposed to data and revise our estimates in accordance with Bayes theorem. But over the last 50 years, a substantial body of evidence has demonstrates that, under at least some circumstances, we don’t make estimates of probability in this way.
7.2 the rules of probability
Here’s an introduction to the principles of probability. These are presented, with examples and code, in this R markdown document at Harvard’s datasciencelabs repository:
I. For any event A, 0 <= P(A) <= 1
II. Let S be the sample space, or set of all possible outcomes. Then P(S) = 1, and P(not S) = 0.
III. If P(A and B) = 0, then P(A or B) = P(A) + P(B).
IV. P(A|B) = P(A and B)/ P(B)
Principle III applies for mutually exclusive events, such as A = you are in class this morning, B = you are at the beach this morning. For mutually exclusive (disjoint, disjunctive) events, the union is the sum of the two events. This is called the addition rule for disjoint events.
A different rule applies for events that are mutually independent, such as (A = I toss a coin and it lands on ‘Heads’) and (B = it will rain tomorrow). What we mean by independent is that our estimates of the probability of one don’t change based on the state of the other - your estimate of the likelihood of rain shouldn’t depend on my coin flip. Here, you multiply rather than add:
If P (A|B) = P (A), then P (A and B) = P(A) P(B).
In words - if the probability of A given B equals the probability of A, then the probability of both A and B equals the probability of A times the probability of B.
This multiplication rule is handy for estimating the probability of an outcome that happens following a chain of independent events, such as the probability that the next eight times I toss a coin it will land on “tails” every time:
P (TTTTTTTT) = P(T) P(T) P(T) P(T) P(T) P(T) P(T) P(T). = .58 = 1/256.
Many sets of events are neither disjoint nor independent, so we need more general ways of thinking about pairs of events. For most of us, Venn diagrams are useful to think about combining probabilities. The union or P(A U B) describes the probability that A, B, or both of these will occur. Here, you will use the general addition rule:
P(A or B) = P(A) + P(B) - P(A and B)
(the probability of A or B is the probability of A plus the probability of B minus the probability of both A and B).
For the intersection or P(A ∩ B), we need to consider conditional probabilities. Think of the probability of two events sequentially: First, what’s the probability of A? Second, what’s the probability of B, given that A has occurred? Multiply these to get the likelihood of A and B:
P(A and B) = P(A) P(B|A).
Example: The probability of you and your roommate both getting COVID equals the probability of your getting COVID times the probability that your roommate gets it, given that you have it also.
This is the general multiplication rule. In this abstract example, the order is irrelevant. To estimate the likelihood of A and B, we could as easily take the probability of B, and multiply it by the conditional probability of A given B
P(A and B) = P(B) P(A|B).
Use the COVID example again. What are A and B here? Does it still make sense? When might P (B|A) make more sense than P (A|B)?
We are often interested in estimating conditional probabilities, in which case we’ll use the same equation, but solve instead for P (A|B). This leads us back to principle IV:
IV. P(A|B) = P (A and B)/ P(B)
7.2.1 keeping conditional probabilities straight
In general, P (B|A) and P (A|B) are not equivalent (see the “did we just drive by a cop?” exercise at the end of this chapter). This is clear when we sketch them out using Venn diagrams - which should often be asymmetrical.
7.2.2 Bayes’ theorem
Bayes’ theorem is a tool for estimating one conditional probability from another. Its often expressed in terms of Hypotheses (H) and Data (D). Assume that we are interested in the probability of a hypothesis in light of some data, or P(H|D).
This depends on the prior probability of the hypothesis, or how likely it was before the data were collected, or P(H). It also depends upon how likely the data are given two states of the world, that is, the probability of the data given that the hypothesis is true, or P (D|H), and the probability of the data given that the hypothesis is not true, or P (D|not H).
It may look complicated at first:
\[ P (H|D) = \frac{P(H)*P(D|H)}{P(H)* P(D|H) + (1-P(H))* P(D|not H)} \] But it is also both powerful and relevant to everyday life. We live in an uncertain world, and constantly form hypotheses about it. We form these hypotheses as questions…
Do pens get trashed on the whiteboard because of a soap film?
Am I allergic to shellfish?
Does studying at the last minute help my exam performance?
Does she like me?
Is my car safe in that parking lot?
Let’s look at the soap-film hypothesis as an example. I begin with an initial P(H), my prior probability (often a base rate) that soap could dry out pens. Say it’s .3
P(H) = soap dries out the pen = .3
I try a new pen on the board, write with it for a few minutes, and ruin it. I think that, if my hypothesis is true, that this is likely (P (D|H) =.7). If my hypothesis is false, it is unlikely (P (D|not H) = .2). What is my posterior probability, or the revised estimate of the probability of the hypothesis in light of data?
\[ P (H|D) = \frac{(.3)(.7)}{(.3)(.7) + (.7)(.2)} \]
My new subjective probability is .6. Additional problems like this are in the exercises below.
| Initial hypothesis | P (H) | New data | P (D|H) | P (D|not H) | P (H|D) |
|---|---|---|---|---|---|
| Pens don’t last because of soap film on the whiteboard | .3 | A pen is quickly trashed | .7 | .2 | .6 |
Bayes’ theorem is a normative model of probability - how we should revise probability estimates, not a descriptive model. That is, the research suggests that people don’t typically revise probability estimates in this way, even though we arguably should - this is illustrated in some of the problems in the final set of exercises on “decision making under uncertainty.”
7.3 continuous probability distributions
We can also use probability with continuous variables such as systolic blood pressure (that’s the first one), which has a mean of approximately 120 and a standard deviation of 15. Continuous probability distributions are handy tools for thinking about the meaning of scores, particularly when we express scores in standard deviations from the mean (z scores). More to the point, this way of thinking about probability is widely used in questions of scientific inference, as, for example, in testing hypotheses such that “the average systolic blood pressure among a group of people studying at a coffee shop (hence caffeinated) will be significantly greater than that of the population as a whole.”
This is part of the logic of Null Hypothesis Significance Testing (NHST) - if the result in my coffee shop sample is sufficiently high, then I say that I have rejected the null hypothesis, and found data which are consistent with the hypothesis of interest.
7.4 dangerous equations
Just as Tufte (2001) demonstrated that poor data visualizations can be dangerous, leading, for example, to the loss of life in the Challenger disaster, Wainer (2007) shows that a lack of statistical literacy is also “dangerous.”
Wainer cites three specific examples of important, yet widely misunderstood, statistical laws. The first of these is deMoivre’s equation, which shows that variability decreases with the square root of sample size. Because the variability of a sample decreases with the size of that sample, small samples tend to have extreme scores. For example, the counties with the highest and lowest rates of kidney cancer (or most other unexplained health measures) will be sparsely populated, typically rural places.
For Wainer, a second form of statistical illiteracy is the failure to understand the complex interdependencies that arise in multiple regression analysis, in particular, how coefficients may change or even reverse in sign when new variables are added as predictors.
Wainer’s third example of statistical illiteracy is the failure to appreciate regression to the mean. I consider this to be the most dangerous form of statistical illiteracy, in part because regression effects contribute to an overestimate of the effectiveness of punishment and an under-appreciation of the effectiveness of positive reinforcement as tools for behavior change (Hastie and Dawes 2010).
7.5 exercises
7.5.1 did we just drive by a cop?
In 2024, the Florida Highway Patrol won a national competition for “best looking cruiser.” The winning car was a Dodge Charger.

Not all FHP cruisers are Dodge Chargers, but some are. Assume that there are 8 million registered cars in Florida, that all cars (including all FHP cruisers) are registered, and that 80,000 of these (or 1% of all cars) are Dodge Chargers.
On the basis of the above information, if you see a Dodge Charger on the road, can you compute the probability that it is an FHP cruiser (i.e., P(FHP cruiser | Dodge Charger)?
If you can compute this, what is the probability? If you cannot compute this, what is the minimum additional information would you need to compute this probability (P(FHP cruiser | Dodge Charger)?
Provide a reasonable estimate of this additional value, then compute (P(FHP cruiser | Dodge Charger).
Working with your own numbers, what is P(Dodge Charger | FHP cruiser)?
How confident are you in these results? Are there any additional assumptions that you might make that would make you more confident about your results?
Sketch out a Venn Diagram that accurately reflects the relationships you described here.
Use R to generate this, using a package such as Venn and VennDiagram).
Look at your figure. In general, if P (A|B) < P (B|A), what must be true of the relationship of P (A) to P (B)?
7.5.2 practice applying Bayes’ Theorem
Here are some more problems like the ‘whiteboard/soap film’ hypothesis described above. Complete the following table using Bayes’ Theorem:
| Initial hypothesis | P (H) | New data | P (D|H) | P (D|not H) | P (H|D) |
|---|---|---|---|---|---|
| I am allergic to shellfish | .02 | Rash after eating clams | .9 | .01 | ? |
| Studying at the last minute helps my exam performance | .6 | I failed | .2 | .7 | ? |
| She likes me | .3 | She gave me a present | .6 | .3 | ? |
| My car is safe in that parking lot | .3 | I just heard a car alarm | .4 | .4 | ? |
| add your own (better!) example here | (??) | (??) | (??) | (??) | ? |
7.5.3 decision making under uncertainty
Try to solve each of the following. You may or may not have all of the information you need. Remember the 15 minute rule.
- A cab was involved in a hit-and-run accident at night and two cab companies, the Green and the Blue, operate in the neighborhood in which the accident occurred. Of the cabs in the neighborhood, 85% are Green and 15% are Blue. A witness identified the cab as Blue. The court tested the accuracy of the witness under the same circumstances that existed on the night of the accident and the witness correctly identified each of the two colors of cabs 80% of the time. What is the probability that the cab was Blue?
- Imagine two giant jars are each filled with thousands of jelly beans. In the first jar, 70% of the jelly beans are red and the rest are blue. In the second jar, 70% are blue and the rest are red. Suppose one jar is chosen, at random, and 12 jelly beans are taken from it: 8 blue jelly beans and 4 red jelly beans. What are the chances that the 12 jelly beans were taken from the jar with mostly red jelly beans?
- You are manager of a baseball team. It is the bottom of the ninth inning, there are two outs, and you are losing by one run. You will lose the game if the next batter makes an out. But because there are base runners, you will win the game if the batter gets a hit. You can choose one of two batters: Smith has an overall batting average of .320 in 400 times at bat, but has only bat .250 in 20 plate appearances against this pitcher. Jones has an overall batting average of only .250 in 400 times at bat, but has bat .320 in 20 plate appearances against this pitcher. Who should you choose?
- James grew up in a Bohemian family. His father was a musician, and his mother was a painter. They lived together for 40 years and never got married. James was a very talented child with a special gift for comedy, but he turned into a rebellious troublemaker in his youth. He dropped out of college after two years and traveled to Asia to learn crafts. James is now 35 years old. Of 100 people like James, how many are Republicans? How many are Artists? How many are Republican Artists?
- Steve is very shy and withdrawn, invariably helpful, but with little interest in people, or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail.” What is the probability that Steve is a Farmer? a Salesman? an Airline Pilot? a Librarian? a Physician?