Inference, Correlation vs Causation, and How Statistics Mislead (Advanced)

By Pritesh Yadav 19 min read

In the last two chapters you learned to describe data (means, spreads, distributions) and to reason about chance (probability, Bayes). This chapter is the payoff. It is about two things that decide whether a number deserves your trust:

  1. Inference — using a small sample to make honest claims about a large population, and saying how sure you are.
  2. Causation — knowing when "these two things move together" actually means "one causes the other," and when it is a trap.

We close with the "defense" layer: the named paradoxes and chart tricks that fool smart people — including the experts who produce the statistics. By the end you should be able to read a headline, a study, or a dashboard and ask the three or four questions that separate real signal from noise.

Key takeaway: A statistic is not a fact. It is a measurement made under conditions, with uncertainty, by someone who chose what to count. Your job is to interrogate the conditions, the uncertainty, and the choices — not just read the number.

12.1 Inference: the leap from spoonful to soup

We almost never measure everyone. We measure a sample and try to say something true about the whole population.

Population
Everyone (or everything) you actually care about — e.g. all adult voters in a country.
Sample
The subset you actually measured — e.g. 1,200 voters you phoned.
Inference
The reasoned leap from "what I saw in the sample" to "what is probably true in the population."
Analogy: Tasting soup. You don't drink the whole pot to know if it needs salt — one well-stirred spoonful tells you everything. But if the pot isn't stirred, the spoonful lies. Stirring is randomness; it is what makes a small taste trustworthy.

The sampling distribution — the idea that makes inference possible

Here is the mental move that everything rests on. Imagine you don't take one sample — you take a thousand. A thousand different pollsters each survey 1,000 people. Each gets a slightly different average. If you collected all 1,000 of those averages and drew a histogram, you'd get a new distribution: the sampling distribution of the mean — the spread of the answer itself, across many possible samples.

You only ever take one sample in real life. But knowing how that imaginary cloud of answers would behave is what lets you say "our single answer is probably within this much of the truth."

ONE population
      |
      |  draw many samples of size n
      v
[sample 1] -> mean = 51.2
[sample 2] -> mean = 49.8
[sample 3] -> mean = 50.5
   ...            ...
      |
      v
Histogram of all those means
= SAMPLING DISTRIBUTION
(narrow + bell-shaped if n is large)

The Central Limit Theorem (CLT)

The Central Limit Theorem says something almost magical: the average of a large-enough sample is approximately a normal (bell) curve, even if the original data is not bell-shaped at all.

Analogy: Roll one die — the outcomes 1–6 are flat, equally likely (no bell). Now roll ten dice and take the average. Do that over and over and the averages pile up into a bell curve around 3.5. The lumpy raw data smooths into a bell once you average.

This is why so much of statistics assumes a bell curve: we usually work with averages, and averages go normal. It's the engine behind confidence intervals and most significance tests.

Common mistake: "n ≥ 30 always makes it normal." This is a rough rule of thumb, not a law. For data that is heavily lopsided — income, insurance claims, website session times — you may need hundreds of observations before the sample mean looks bell-shaped. The more skewed (lopsided) the data, the bigger the sample the CLT needs. Don't recite "30" and stop thinking.

Standard error: SD's cousin that everyone confuses

This distinction is constantly mangled, so go slowly.

Standard deviation (SD)Standard error (SE)
Measures the spread of…individual data points (people)sample averages (averages-of-people)
Answers…"How far does a typical person sit from average?""How much would my estimate of the average wobble if I redid the study?"
Effect of a bigger sampleDoesn't shrink — it describes the populationShrinks (≈ SD ÷ √n) — bigger samples give steadier estimates
Example: Heights of 10,000 men have an SD of about 7 cm — that's a real fact about people and won't change if you measure more men. But the SE of your estimate of the average height from a sample of 100 is much smaller, and it gets smaller still if you sample 10,000. SD describes the world; SE describes how confident you are about your snapshot of it. Notice the √n: to halve your error you need four times the sample.

12.2 Confidence intervals: report a lane, not a dot

A single number ("52% support") hides how shaky it is. A confidence interval (CI) reports a range plus a reliability level.

Confidence interval
A range built by a method that captures the true value a stated percentage of the time (commonly 95%). You'll often see it as "52% ± 3%", i.e. 49%–55%.
Analogy: Throwing a lasso at a fence post in the dark. You can't see if any single throw landed around the post. But you know your technique catches the post 95 times out of 100. The CI is the lasso method, not a guarantee about this one throw.
Common mistake: Reading "95% CI" as "there's a 95% chance the true value is inside this particular range." That is wrong. The true value is fixed; it's either in this interval or it isn't. The 95% describes the procedure across many hypothetical samples — 95% of intervals built this way would catch the truth. Subtle, but the whole logic of inference depends on it.

In the news, a poll's "margin of error" is just half the width of a confidence interval. A poll with "52% ± 3%" is telling you the race could honestly be anywhere from 49% to 55% — which often means "too close to call," no matter how the headline reads.

Best practice: Always report (and demand) a range, not just a point. An estimate without its uncertainty is half an answer. Error bars and CIs aren't decoration — they are the result.

12.3 Hypothesis testing and the p-value (handle with care)

The second goal of inference is deciding whether an effect is real or just luck of the draw. The framework is deliberately conservative, like a courtroom.

Null hypothesis
The boring default: "no effect, no difference." The new drug does nothing; the two groups are the same.
Hypothesis test
You assume the null is true, then ask: "How surprising is the data I actually got?"
p-value
The probability of seeing data at least this extreme if the null were true. Small p = the data is surprising under "nothing's happening," so maybe something is happening.
Analogy: Courtroom. The defendant is presumed innocent (the null). You don't have to prove innocence; you need strong evidence to overturn it. A small p-value is strong evidence against innocence — but "strong evidence" is not the same as "certainly guilty."

By convention we "reject the null" when p falls below a threshold called alpha (α), usually 0.05. That 0.05 was popularized by the statistician Ronald Fisher; it is a habit, not a law of nature.

The p-value is the single most misunderstood number in science. The American Statistical Association issued a formal 2016 statement just to correct the abuse. Memorize what it is not:

Common mistake — what a p-value is NOT:
  • It is not the probability that your hypothesis is true.
  • It is not the probability the result happened "by chance."
  • It is not a measure of how big or important the effect is.
  • p > 0.05 does not prove the null. "No evidence of an effect" is absence of evidence, not evidence of absence.

Statistical vs practical significance

This is where huge datasets quietly mislead. With a million users, almost any difference becomes "statistically significant" — because a giant sample shrinks the standard error to nearly nothing.

Example: A weight-loss pill is tested on a million people and produces an average loss of 0.1 kg, with p < 0.001. The result is rock-solid statistically — and completely useless practically. 0.1 kg is a single glass of water. Significant ≠ meaningful.

That is why you must always pair significance with effect size — a number for how big the effect is (the 0.1 kg, the 3% lift, the $40 difference), separate from whether it cleared a p-threshold.

Best practice: Lead with effect size, then significance. Ask "How big is it, and does it matter to a real decision?" before "Is it significant?" A confidence interval often beats a p-value because it shows the effect size and the uncertainty in one shot.

Two ways to be wrong

Null is actually trueNull is actually false
You reject the nullType I error (false positive — cried wolf)Correct! (you found a real effect)
You keep the nullCorrect! (no effect, you said none)Type II error (false negative — missed the wolf)

Power is the ability to catch a real effect when it exists (avoiding the false negative). Bigger samples mean more power. The α=0.05 threshold is like a smoke alarm's sensitivity dial: turn it too sensitive and you get false alarms (Type I); turn it too dull and you miss real fires (Type II).

12.4 The big one: correlation is not causation

If you remember one sentence from this entire guide, make it this. Correlation measures how two things move together; it says nothing about why.

Correlation (r)
A number from −1 to +1. +1 = perfect lockstep upward; 0 = no straight-line link; −1 = perfect opposite. Height and weight sit around +0.7. It only captures straight-line association.
Causation
One thing actually makes the other happen. Far harder to prove.
Example — the canonical one: In summer, ice-cream sales rise and drownings rise. They are strongly correlated. Does ice cream cause drowning? Of course not. A hidden third factor — hot weather — drives both: heat sells ice cream and sends people swimming. Ban ice cream and drownings won't budge.
Confounding variable (lurking variable)
A hidden third factor that influences both X and Y, creating a link between them that isn't causal. Heat is the confounder above.
  What it LOOKS like:   ice cream  -->  drownings

  What's REALLY going on:
                         hot weather
                        /            \
                       v              v
                 ice cream        drownings
   (the two are linked only THROUGH the hidden cause)
Spurious correlation
A correlation that is pure coincidence or driven entirely by a confounder, with no real connection.
Example: Tyler Vigen famously collected joke correlations — like per-person cheese consumption tracking the number of people who died tangled in their bedsheets. They line up beautifully on a chart and mean absolutely nothing. With thousands of variables, some will match by sheer luck. Finding a high r is easy; earning the word "cause" is hard.
Key takeaway: When two things move together there are always several explanations: X causes Y, Y causes X, a confounder Z causes both, or it's coincidence. "X causes Y" is just one of these. Before believing it, name a confounder out loud.

Regression: a line, still not a cause

Regression fits the straight line that best predicts Y from X. Its slope tells you the per-unit effect.

Example: A regression might say "each extra year of education is associated with about $3,000 more annual salary." The slope is that $3,000. But "associated with" is doing heavy lifting — maybe driven, ambitious families both push education and earn more (a confounder). Regression describes the line; it does not appoint education as the cause.

("R-squared") is the fraction of the variation in Y the line explains, from 0 to 1. R² = 0.4 means the model accounts for 40% of the ups and downs in Y; the other 60% is unexplained. A high R² still does not mean causation.

How you actually earn the word "cause"

The gold standard is the randomized controlled trial (RCT): you randomly assign who gets the treatment and who doesn't.

Analogy: A coin flip decides who gets the real drug and who gets a sugar pill. Because the flip is random, the two groups are alike in every way — age, diet, wealth, hidden confounders — except the drug. So if outcomes differ, the drug is the only systematic cause left standing. Randomization is what breaks confounding.
Best practice: For any observed correlation, brainstorm confounders before believing causation. Only randomization (or careful causal-inference methods) earns "cause." Observational data alone — surveys, historical records, dashboards — can suggest, never prove.

12.5 The defense layer: paradoxes that fool experts

These named traps recur everywhere. Learn to spot them by name.

Base-rate neglect — and Bayes to the rescue

The base rate is how common something is before any test or evidence. Ignoring it produces wildly wrong conclusions, and even doctors fall for it.

Example — the rare-disease test: A disease affects 1 in 1,000 people. A test has a 5% false-positive rate. Your test comes back positive. What's the chance you're actually sick? Most people — and many doctors — say about 95%. The true answer is roughly 2%.

Why? Imagine 1,000 people. About 1 is truly sick (the base rate) and tests positive. But of the 999 healthy people, 5% — about 50 — also test positive (false alarms). So ~51 people test positive and only 1 is sick: 1 ÷ 51 ≈ 2%. The huge healthy group produces far more false positives than the tiny sick group produces true ones.
1,000 people
  |--- 1 sick      -> ~1 positive  (true)
  |--- 999 healthy -> ~50 positive (false, 5% of 999)
                      ------------------------
                      ~51 positives, only 1 real
   chance you're sick given a + test = 1/51 ≈ 2%

This is Bayes' theorem (Chapter 11) in action: your posterior belief combines the evidence with the base rate. Skip the base rate and you'll badly overestimate.

Simpson's paradox — when subgroups all say the opposite of the whole

A trend that holds in every subgroup can reverse when you lump them together.

Example — UC Berkeley, 1973: Overall, men were admitted to graduate school at a higher rate than women (about 44% vs 35%) — looks like bias against women. But department by department, women were admitted at equal or higher rates than men in nearly every department. How? Women applied more often to the most competitive departments (low admit rates for everyone), while men applied to easier ones. The aggregate reversed the truth in the subgroups.
Example — kidney stones: Open surgery beats a less-invasive procedure for both small stones and large stones — yet looks worse overall. Why? Surgeons used the tougher surgery on the harder (large-stone) cases. The mix, not the treatment, drove the misleading total.
Common mistake: Trusting an aggregate number without asking how the groups were mixed. Always check subgroups. An overall rate can point in exactly the wrong direction when the groups being averaged are very different in size or difficulty.

Regression to the mean — the invisible force behind false cause

Regression to the mean says: an extreme measurement tends to be followed by a more ordinary one — for no reason except that extremes are partly luck, and luck doesn't repeat.

Example: A rookie athlete has a blazing first season, lands on a magazine cover, then cools off — the "cover jinx." No curse: the hot streak was partly luck, and luck regresses. The same illusion makes us think punishing bad performance "works" (a terrible day was going to improve anyway) and praising great performance "backfires" (a peak day was going to dip anyway). We credit the intervention for what was just reversion.

Survivorship bias — studying only the winners

You analyze the survivors you can see and forget the ones who vanished.

Example — Abraham Wald and the WWII planes: The military wanted to armor returning bombers where they had the most bullet holes. Statistician Abraham Wald said the opposite: armor the spots with no holes. The planes hit there never made it back to be counted. The data was all survivors — the gaps marked the fatal hits.

The same trap hides in business advice: studying only successful founders ("they all dropped out of college!") while ignoring the identical-looking people who did the same thing and failed silently.

12.6 How charts lie

Our brains read a slope or shape faster than we read the labels — so a chart can mislead before you've processed a single number. The classics:

TrickWhat it does
Truncated y-axisThe vertical axis doesn't start at zero, so a tiny 1% difference looks like a cliff.
Cherry-picked time windowShows only the months that suit the story, hiding the longer trend.
Dual axesTwo unrelated lines on two different scales, slid until they appear to track — manufacturing a fake correlation.
Area / 3-D effectsDoubling a circle's radius quadruples its area, so "twice as big" looks four times as big.
Color emphasisOne bar in alarm-red to draw the eye and steer the conclusion.
Example: A company posts a bar chart where the y-axis runs from 98% to 100%. A move from 99.0% to 99.2% — basically nothing — becomes a bar that's visibly twice as tall. Redraw it starting at zero and the "huge improvement" vanishes.

This goes back to Darrell Huff's 1954 classic How to Lie with Statistics, still the best short tour of statistical and visual deception.

Best practice: Before you trust a chart, check four things: the axes (do they start at zero?), the time window (why these dates?), the denominator ("out of how many?"), and the source (who benefits from this conclusion?).

12.7 A field guide to the other named traps

Gambler's fallacy
"Red is due after five blacks." Independent events have no memory — the next spin is still 50/50. (Chapter 11.)
Texas sharpshooter / p-hacking
Firing at a barn, then painting the target around the tightest cluster. In data: testing dozens of things and reporting only the ones that came out significant. Test enough, and a false positive is guaranteed.
HARKing
"Hypothesizing After the Results are Known" — pretending the thing you found by accident was what you set out to test. Cheating, dressed as discovery.
Prosecutor's fallacy
Confusing "the chance of this evidence if the person were innocent" with "the chance the person is innocent given the evidence." A 1-in-a-million DNA match in a city of a million still has, on the base rate, about one other matching innocent person. (Bayes again.)
Relative vs absolute risk
"Doubles your risk!" can mean going from 1-in-a-million to 2-in-a-million — a true statement that's practically meaningless. Always ask for the absolute numbers.
Anecdote vs data
"My uncle smoked till 90" is one vivid story, not evidence against a trend measured across millions.
Example — p-hacking in the wild: A team tests whether a new feature affects 40 different metrics. Two come back "significant" at p < 0.05. But at the 0.05 threshold you'd expect about 2 of 40 to look significant by pure chance even if the feature did nothing. Reporting only those two, and burying the other 38, is the sharpshooter fallacy in a slide deck.
Best practice: State your hypothesis and your sample size before you look at the data. Pre-committing is the single strongest defense against fooling yourself (and others) with p-hacking and HARKing.

12.8 Putting it together: a checklist for reading any statistic

When a number lands in front of you — a study, a headline, a dashboard, an ad — run it through this:

  1. Plot it / picture it first. Anscombe's quartet — four datasets with identical means, SDs, and correlation that look completely different when graphed — proves summary numbers can hide skew, outliers, and curves. Always visualize.
  2. Mean or median? For skewed things (income, prices, wait times), demand the median. Remember Bill Gates walking into a bar: the mean wealth rockets up, the median barely moves. The mean lies about the typical person.
  3. Where's the uncertainty? A point estimate with no CI or error bar is half an answer.
  4. Significant or meaningful? Get the effect size. Big samples make trivial effects "significant."
  5. Correlation or causation? Name a confounder before believing cause. Was it an RCT, or just observation?
  6. Check the subgroups. Could Simpson's paradox be flipping the aggregate?
  7. Who's missing? Survivorship and sampling bias — whose data never made it into the dataset?
  8. Interrogate the chart. Axes, window, denominator, source.
  9. Out of how many? Always find the base rate and the denominator. Relative risk without absolute numbers is theater.
Key takeaway: Statistics is not a truth machine; it's a toolkit for making decisions under uncertainty. As the statistician George Box put it, "All models are wrong, but some are useful." The goal isn't certainty — it's calibrated, honest, well-defended belief. A bigger sample fixes noise, never bias: a huge biased survey is just confidently wrong.

Master this chapter and you've gained something rare: the ability to be appropriately skeptical without being cynical. You can hold "the evidence leans this way" and "I might be wrong, here's how much" in your head at the same time. That balance — updating your beliefs with new evidence instead of flipping between blind trust and blanket dismissal — is the whole point of data literacy.

Continue reading