Inference, Correlation vs Causation, and How Statistics Mislead (Advanced)
In the last two chapters you learned to describe data (means, spreads, distributions) and to reason about chance (probability, Bayes). This chapter is the payoff. It is about two things that decide whether a number deserves your trust:
- Inference — using a small sample to make honest claims about a large population, and saying how sure you are.
- Causation — knowing when "these two things move together" actually means "one causes the other," and when it is a trap.
We close with the "defense" layer: the named paradoxes and chart tricks that fool smart people — including the experts who produce the statistics. By the end you should be able to read a headline, a study, or a dashboard and ask the three or four questions that separate real signal from noise.
12.1 Inference: the leap from spoonful to soup
We almost never measure everyone. We measure a sample and try to say something true about the whole population.
- Population
- Everyone (or everything) you actually care about — e.g. all adult voters in a country.
- Sample
- The subset you actually measured — e.g. 1,200 voters you phoned.
- Inference
- The reasoned leap from "what I saw in the sample" to "what is probably true in the population."
The sampling distribution — the idea that makes inference possible
Here is the mental move that everything rests on. Imagine you don't take one sample — you take a thousand. A thousand different pollsters each survey 1,000 people. Each gets a slightly different average. If you collected all 1,000 of those averages and drew a histogram, you'd get a new distribution: the sampling distribution of the mean — the spread of the answer itself, across many possible samples.
You only ever take one sample in real life. But knowing how that imaginary cloud of answers would behave is what lets you say "our single answer is probably within this much of the truth."
ONE population
|
| draw many samples of size n
v
[sample 1] -> mean = 51.2
[sample 2] -> mean = 49.8
[sample 3] -> mean = 50.5
... ...
|
v
Histogram of all those means
= SAMPLING DISTRIBUTION
(narrow + bell-shaped if n is large)
The Central Limit Theorem (CLT)
The Central Limit Theorem says something almost magical: the average of a large-enough sample is approximately a normal (bell) curve, even if the original data is not bell-shaped at all.
This is why so much of statistics assumes a bell curve: we usually work with averages, and averages go normal. It's the engine behind confidence intervals and most significance tests.
Standard error: SD's cousin that everyone confuses
This distinction is constantly mangled, so go slowly.
| Standard deviation (SD) | Standard error (SE) | |
|---|---|---|
| Measures the spread of… | individual data points (people) | sample averages (averages-of-people) |
| Answers… | "How far does a typical person sit from average?" | "How much would my estimate of the average wobble if I redid the study?" |
| Effect of a bigger sample | Doesn't shrink — it describes the population | Shrinks (≈ SD ÷ √n) — bigger samples give steadier estimates |
12.2 Confidence intervals: report a lane, not a dot
A single number ("52% support") hides how shaky it is. A confidence interval (CI) reports a range plus a reliability level.
- Confidence interval
- A range built by a method that captures the true value a stated percentage of the time (commonly 95%). You'll often see it as "52% ± 3%", i.e. 49%–55%.
In the news, a poll's "margin of error" is just half the width of a confidence interval. A poll with "52% ± 3%" is telling you the race could honestly be anywhere from 49% to 55% — which often means "too close to call," no matter how the headline reads.
12.3 Hypothesis testing and the p-value (handle with care)
The second goal of inference is deciding whether an effect is real or just luck of the draw. The framework is deliberately conservative, like a courtroom.
- Null hypothesis
- The boring default: "no effect, no difference." The new drug does nothing; the two groups are the same.
- Hypothesis test
- You assume the null is true, then ask: "How surprising is the data I actually got?"
- p-value
- The probability of seeing data at least this extreme if the null were true. Small p = the data is surprising under "nothing's happening," so maybe something is happening.
By convention we "reject the null" when p falls below a threshold called alpha (α), usually 0.05. That 0.05 was popularized by the statistician Ronald Fisher; it is a habit, not a law of nature.
The p-value is the single most misunderstood number in science. The American Statistical Association issued a formal 2016 statement just to correct the abuse. Memorize what it is not:
- It is not the probability that your hypothesis is true.
- It is not the probability the result happened "by chance."
- It is not a measure of how big or important the effect is.
- p > 0.05 does not prove the null. "No evidence of an effect" is absence of evidence, not evidence of absence.
Statistical vs practical significance
This is where huge datasets quietly mislead. With a million users, almost any difference becomes "statistically significant" — because a giant sample shrinks the standard error to nearly nothing.
That is why you must always pair significance with effect size — a number for how big the effect is (the 0.1 kg, the 3% lift, the $40 difference), separate from whether it cleared a p-threshold.
Two ways to be wrong
| Null is actually true | Null is actually false | |
|---|---|---|
| You reject the null | Type I error (false positive — cried wolf) | Correct! (you found a real effect) |
| You keep the null | Correct! (no effect, you said none) | Type II error (false negative — missed the wolf) |
Power is the ability to catch a real effect when it exists (avoiding the false negative). Bigger samples mean more power. The α=0.05 threshold is like a smoke alarm's sensitivity dial: turn it too sensitive and you get false alarms (Type I); turn it too dull and you miss real fires (Type II).
12.4 The big one: correlation is not causation
If you remember one sentence from this entire guide, make it this. Correlation measures how two things move together; it says nothing about why.
- Correlation (r)
- A number from −1 to +1. +1 = perfect lockstep upward; 0 = no straight-line link; −1 = perfect opposite. Height and weight sit around +0.7. It only captures straight-line association.
- Causation
- One thing actually makes the other happen. Far harder to prove.
- Confounding variable (lurking variable)
- A hidden third factor that influences both X and Y, creating a link between them that isn't causal. Heat is the confounder above.
What it LOOKS like: ice cream --> drownings
What's REALLY going on:
hot weather
/ \
v v
ice cream drownings
(the two are linked only THROUGH the hidden cause)
- Spurious correlation
- A correlation that is pure coincidence or driven entirely by a confounder, with no real connection.
Regression: a line, still not a cause
Regression fits the straight line that best predicts Y from X. Its slope tells you the per-unit effect.
R² ("R-squared") is the fraction of the variation in Y the line explains, from 0 to 1. R² = 0.4 means the model accounts for 40% of the ups and downs in Y; the other 60% is unexplained. A high R² still does not mean causation.
How you actually earn the word "cause"
The gold standard is the randomized controlled trial (RCT): you randomly assign who gets the treatment and who doesn't.
12.5 The defense layer: paradoxes that fool experts
These named traps recur everywhere. Learn to spot them by name.
Base-rate neglect — and Bayes to the rescue
The base rate is how common something is before any test or evidence. Ignoring it produces wildly wrong conclusions, and even doctors fall for it.
Why? Imagine 1,000 people. About 1 is truly sick (the base rate) and tests positive. But of the 999 healthy people, 5% — about 50 — also test positive (false alarms). So ~51 people test positive and only 1 is sick: 1 ÷ 51 ≈ 2%. The huge healthy group produces far more false positives than the tiny sick group produces true ones.
1,000 people
|--- 1 sick -> ~1 positive (true)
|--- 999 healthy -> ~50 positive (false, 5% of 999)
------------------------
~51 positives, only 1 real
chance you're sick given a + test = 1/51 ≈ 2%
This is Bayes' theorem (Chapter 11) in action: your posterior belief combines the evidence with the base rate. Skip the base rate and you'll badly overestimate.
Simpson's paradox — when subgroups all say the opposite of the whole
A trend that holds in every subgroup can reverse when you lump them together.
Regression to the mean — the invisible force behind false cause
Regression to the mean says: an extreme measurement tends to be followed by a more ordinary one — for no reason except that extremes are partly luck, and luck doesn't repeat.
Survivorship bias — studying only the winners
You analyze the survivors you can see and forget the ones who vanished.
The same trap hides in business advice: studying only successful founders ("they all dropped out of college!") while ignoring the identical-looking people who did the same thing and failed silently.
12.6 How charts lie
Our brains read a slope or shape faster than we read the labels — so a chart can mislead before you've processed a single number. The classics:
| Trick | What it does |
|---|---|
| Truncated y-axis | The vertical axis doesn't start at zero, so a tiny 1% difference looks like a cliff. |
| Cherry-picked time window | Shows only the months that suit the story, hiding the longer trend. |
| Dual axes | Two unrelated lines on two different scales, slid until they appear to track — manufacturing a fake correlation. |
| Area / 3-D effects | Doubling a circle's radius quadruples its area, so "twice as big" looks four times as big. |
| Color emphasis | One bar in alarm-red to draw the eye and steer the conclusion. |
This goes back to Darrell Huff's 1954 classic How to Lie with Statistics, still the best short tour of statistical and visual deception.
12.7 A field guide to the other named traps
- Gambler's fallacy
- "Red is due after five blacks." Independent events have no memory — the next spin is still 50/50. (Chapter 11.)
- Texas sharpshooter / p-hacking
- Firing at a barn, then painting the target around the tightest cluster. In data: testing dozens of things and reporting only the ones that came out significant. Test enough, and a false positive is guaranteed.
- HARKing
- "Hypothesizing After the Results are Known" — pretending the thing you found by accident was what you set out to test. Cheating, dressed as discovery.
- Prosecutor's fallacy
- Confusing "the chance of this evidence if the person were innocent" with "the chance the person is innocent given the evidence." A 1-in-a-million DNA match in a city of a million still has, on the base rate, about one other matching innocent person. (Bayes again.)
- Relative vs absolute risk
- "Doubles your risk!" can mean going from 1-in-a-million to 2-in-a-million — a true statement that's practically meaningless. Always ask for the absolute numbers.
- Anecdote vs data
- "My uncle smoked till 90" is one vivid story, not evidence against a trend measured across millions.
12.8 Putting it together: a checklist for reading any statistic
When a number lands in front of you — a study, a headline, a dashboard, an ad — run it through this:
- Plot it / picture it first. Anscombe's quartet — four datasets with identical means, SDs, and correlation that look completely different when graphed — proves summary numbers can hide skew, outliers, and curves. Always visualize.
- Mean or median? For skewed things (income, prices, wait times), demand the median. Remember Bill Gates walking into a bar: the mean wealth rockets up, the median barely moves. The mean lies about the typical person.
- Where's the uncertainty? A point estimate with no CI or error bar is half an answer.
- Significant or meaningful? Get the effect size. Big samples make trivial effects "significant."
- Correlation or causation? Name a confounder before believing cause. Was it an RCT, or just observation?
- Check the subgroups. Could Simpson's paradox be flipping the aggregate?
- Who's missing? Survivorship and sampling bias — whose data never made it into the dataset?
- Interrogate the chart. Axes, window, denominator, source.
- Out of how many? Always find the base rate and the denominator. Relative risk without absolute numbers is theater.
Master this chapter and you've gained something rare: the ability to be appropriately skeptical without being cynical. You can hold "the evidence leans this way" and "I might be wrong, here's how much" in your head at the same time. That balance — updating your beliefs with new evidence instead of flipping between blind trust and blanket dismissal — is the whole point of data literacy.