Probability, Uncertainty, Distributions, and Bayesian Thinking
The world rarely hands us certainty. A weather forecast says "70% chance of rain." A doctor says a test is "95% accurate." A friend says their cousin made a fortune on a stock, so it must be a good bet. To live and decide well, you need a language for uncertainty — and that language is probability. This chapter builds that language from the ground up, then shows you the single most useful skill in all of statistics: how to update what you believe when new evidence shows up.
In the foundations chapter you met the idea of summarizing data — averages, spread, the shape of a pile of numbers. Here we go one step further. Instead of describing what already happened, we ask: how likely is something, and how should I change my mind as I learn more?
11.1 Distributions: the shape of a pile of data
Before probability, we need a picture. When you collect many measurements — heights, salaries, test scores, daily sales — they form a distribution.
- Distribution
- The pattern of how often each value (or range of values) shows up in your data.
- Histogram
- A bar chart of a distribution. You sort values into buckets and the bar height shows how many landed in each bucket.
The histogram is the foundational picture of statistics. Always look at it first, because the same average can hide wildly different shapes.
The normal distribution (the bell curve)
The most famous shape is the normal distribution — symmetric, with one peak in the middle. Most values cluster near the center; very few sit far out at the edges.
Normal distribution (bell curve)
count
| .-"""-.
| .' '.
| / \
| / \
| .' '.
| __.-' '-.__
+------------------------------------ value
-3SD -2SD -SD mean +SD +2SD +3SD
The bell curve has a beautiful, reliable rule called the 68–95–99.7 rule. "SD" means standard deviation — the typical distance of a value from the average (covered in the foundations chapter).
- About 68% of values fall within 1 SD of the mean.
- About 95% fall within 2 SD.
- About 99.7% fall within 3 SD.
Skew: when the bell tips over
Skew is lopsidedness. A distribution is right-skewed when it has a long tail stretching to the right, and left-skewed when the tail stretches left.
| Shape | What it looks like | Real-world example |
|---|---|---|
| Right-skewed | Long tail to the right; mean > median | Income, wealth, wait times, house prices |
| Left-skewed | Long tail to the left; mean < median | An easy exam where most people score high |
| Symmetric (normal) | Balanced both sides; mean ≈ median | Adult heights, measurement errors |
| Bimodal | Two separate peaks | Two hidden groups mixed together |
A bimodal distribution — two peaks — is a useful warning sign. It often means two different groups got mixed together (for example, the heights of men and women combined). When you see two humps, your instinct should be: "I should probably split this data into two groups."
11.2 Probability: a number for "how likely"
Probability is a number between 0 and 1 that measures how likely an event is. 0 means impossible, 1 means certain, 0.5 means "as likely as not."
Three rules cover most everyday probability:
- All outcomes sum to 1. If there's a 30% chance of rain, there's a 70% chance of no rain. The possibilities must add up to 100%.
- "Either/or" events: add. The chance of rolling a 1 or a 2 on a die is 1/6 + 1/6 = 2/6 (when the outcomes can't both happen at once).
- "Both" independent events: multiply. The chance of flipping two coins and getting heads and then heads again is 0.5 × 0.5 = 0.25.
Independence and the Gambler's Fallacy
- Independent events
- Two events are independent if one happening doesn't change the odds of the other.
Conditional probability: when knowing one thing changes the odds
Not all events are independent. Often, knowing that one thing happened changes how likely something else is. This is conditional probability, written P(A|B) — read aloud as "the probability of A given B."
Conditional probability is the doorway to the most important idea in this chapter.
11.3 Bayes' theorem: updating beliefs with evidence
This is the conceptual heart of the whole subject, so we'll go slowly. Bayes' theorem is simply a rule for updating a belief when new evidence arrives. Named after Thomas Bayes and developed by Pierre-Simon Laplace, it answers: "Given what I knew before, and given this new clue, what should I believe now?"
- Prior
- Your belief before seeing the new evidence — often just the base rate, "how common is this thing in general?"
- Posterior
- Your updated belief after folding in the evidence.
The rare-disease test (work through this carefully)
This is the single example that makes Bayesian thinking click. It catches almost everyone — including most doctors.
Most people answer "about 95%." The real answer is around 2%. Here's why — and the trick is to think in terms of actual people, not percentages.
Imagine 100,000 people get tested.
100,000 people tested
|
|-- 100 actually SICK (1 in 1,000)
| `-- ~100 test positive (true positives)
|
`-- 99,900 actually HEALTHY
`-- 5% test positive
= ~4,995 false positives
Total positives = 100 + 4,995 = 5,095
Truly sick among positives = 100 / 5,095 ≈ 2%
The disease is so rare that the tiny sick group produces only about 100 true positives, while the enormous healthy group — even with just a 5% error rate — produces nearly 5,000 false alarms. So a positive result is far more likely to be a false alarm than a real case. Your posterior belief (~2%) is much lower than the test's "95% accuracy" suggested, because your prior (1 in 1,000) was very low.
This isn't just about medicine. It's how a detective weighs a clue, how a spam filter judges an email, and how you should react to a single dramatic headline. Strong evidence against a very unlikely prior still leaves you fairly uncertain.
11.4 From sample back to population
We almost never measure everyone. We measure a sample — a subset — and use it to talk about the whole population — everyone we care about.
"Well stirred" means random sampling: every member of the population has a known chance of being picked, so the sample mirrors the whole.
When the spoonful lies: sampling bias
- Sampling bias
- When your method of choosing the sample systematically over- or under-represents some group.
Survivorship bias: counting only the survivors
Survivorship bias is everywhere: studying only successful founders to learn "the habits of success" ignores the identical-habit founders who failed. The losers aren't in the room.
The sampling distribution and the Central Limit Theorem
Here's a subtle but powerful idea. If you took many different samples and computed the average of each, those averages would themselves form a distribution — the sampling distribution.
This leads to one of the most remarkable results in statistics, the Central Limit Theorem (CLT):
Standard deviation vs standard error — don't confuse them
These two are mixed up constantly, even by professionals.
- Standard deviation (SD)
- The spread of the individual data points — how far typical people sit from the average.
- Standard error (SE)
- The spread of the sample averages — how much your estimate of the average wobbles from sample to sample.
SD describes the spread of people; SE describes the spread of averages-of-people. Crucially, SE shrinks as your sample grows (roughly SD divided by the square root of n). Bigger samples give steadier, more trustworthy estimates of the average — even though the underlying people are just as varied as ever.
11.5 Inference: estimates, confidence, and significance
Once you accept that a sample is a noisy peek at a population, you face two jobs: estimate a quantity (with a range around it), or decide whether an effect is real.
Confidence intervals: report a range, not a point
- Confidence interval (CI)
- A range built by a method that captures the true value a stated percentage of the time (e.g., 95%).
Hypothesis testing and the p-value
The other job is deciding whether an effect is real. The standard machinery starts with a null hypothesis.
- Null hypothesis
- The default assumption that there is "no effect" or "no difference."
How surprising? That's the p-value.
- p-value
- The probability of seeing data at least this extreme if the null hypothesis were true.
The p-value is one of the most misunderstood numbers in all of science. The American Statistical Association issued a formal statement in 2016 to correct the abuse. Here is what a p-value is not:
Statistical significance — and its limits
Statistical significance means the p-value fell below a pre-chosen threshold, called alpha (α), commonly 0.05. Below it, we "reject the null."
| Effect is really there | No real effect | |
|---|---|---|
| Test says "yes" | Correct (good) | Type I error (false positive) |
| Test says "no" | Type II error (false negative) | Correct (good) |
Power is a test's ability to detect a real effect when there is one. Larger samples generally give more power — they make a real effect easier to spot above the noise.
- Effect size
- How big the effect is — separate from whether it's statistically significant.
11.6 Quick reference: the uncertainty toolkit
| Concept | One-line meaning | Watch out for |
|---|---|---|
| Normal distribution | Symmetric bell curve; 68–95–99.7 rule | Not all data is normal |
| Skew | Lopsided data; right-skew → mean > median | Use median for skewed data |
| Probability | Likelihood from 0 to 1 | Add for "or," multiply for independent "and" |
| Independence | One event doesn't affect the other | Gambler's fallacy |
| Conditional P(A|B) | Odds of A given B happened | The condition changes the odds |
| Bayes / prior | Update belief = old belief + evidence | Base-rate neglect |
| Sampling bias | Sample misrepresents the population | Size doesn't fix bias |
| CLT | Sample averages tend toward a bell curve | "n≥30" is a heuristic, not a law |
| SD vs SE | Spread of people vs wobble of the average | Constantly confused |
| Confidence interval | Plausible range for the true value | It's a property of the method |
| p-value | How surprising the data is if null is true | Not the chance the hypothesis is true |
| Effect size | How big the effect is | Significance ≠ importance |
11.7 Putting it together
Uncertainty isn't a flaw in your knowledge to be ashamed of — it's a quantity to measure and reason with. The skilled thinker doesn't flip between "I'm certain" and "I have no idea." Instead they hold a belief with a confidence level, and they move that confidence up or down as evidence arrives.
Three habits carry most of the value of this chapter:
- Plot the data and look at its shape before trusting any single number. Skew and bimodality hide inside innocent-looking averages.
- Start from the base rate when judging any test, alarm, or surprising result. A scary positive against a rare condition is usually a false alarm — that's just Bayes' theorem at work.
- Report ranges, not points; effect sizes, not just significance. Treat statistics as a tool for deciding under uncertainty, never as a machine that prints "truth."
In the next chapter we turn from single variables and beliefs to relationships between variables: correlation, the famous trap of mistaking it for causation, and the fallacies and chart tricks that fool even careful readers.