Probability, Uncertainty, Distributions, and Bayesian Thinking

By Pritesh Yadav 17 min read

The world rarely hands us certainty. A weather forecast says "70% chance of rain." A doctor says a test is "95% accurate." A friend says their cousin made a fortune on a stock, so it must be a good bet. To live and decide well, you need a language for uncertainty — and that language is probability. This chapter builds that language from the ground up, then shows you the single most useful skill in all of statistics: how to update what you believe when new evidence shows up.

In the foundations chapter you met the idea of summarizing data — averages, spread, the shape of a pile of numbers. Here we go one step further. Instead of describing what already happened, we ask: how likely is something, and how should I change my mind as I learn more?

11.1 Distributions: the shape of a pile of data

Before probability, we need a picture. When you collect many measurements — heights, salaries, test scores, daily sales — they form a distribution.

Distribution
The pattern of how often each value (or range of values) shows up in your data.
Histogram
A bar chart of a distribution. You sort values into buckets and the bar height shows how many landed in each bucket.
Analogy: Pour a big bag of M&Ms onto a table and sort them into colored piles. The piles that are tallest tell you which colors are most common. A histogram is exactly that — piles, but for numbers.

The histogram is the foundational picture of statistics. Always look at it first, because the same average can hide wildly different shapes.

The normal distribution (the bell curve)

The most famous shape is the normal distribution — symmetric, with one peak in the middle. Most values cluster near the center; very few sit far out at the edges.

Analogy: The heights of adult men. Most men are near the average height; a few are unusually tall and a few unusually short, and the further from average you go, the rarer it gets — equally on both sides.
        Normal distribution (bell curve)

 count
   |              .-"""-.
   |            .'       '.
   |           /           \
   |          /             \
   |        .'               '.
   |   __.-'                   '-.__
   +------------------------------------ value
        -3SD -2SD -SD  mean +SD +2SD +3SD

The bell curve has a beautiful, reliable rule called the 68–95–99.7 rule. "SD" means standard deviation — the typical distance of a value from the average (covered in the foundations chapter).

  • About 68% of values fall within 1 SD of the mean.
  • About 95% fall within 2 SD.
  • About 99.7% fall within 3 SD.
Example: Suppose a standardized test has an average score of 500 and an SD of 100. Then about 68% of test-takers score between 400 and 600, about 95% score between 300 and 700, and almost everyone (99.7%) scores between 200 and 800. A score of 750 is rare — it sits 2.5 SD above average.

Skew: when the bell tips over

Skew is lopsidedness. A distribution is right-skewed when it has a long tail stretching to the right, and left-skewed when the tail stretches left.

ShapeWhat it looks likeReal-world example
Right-skewedLong tail to the right; mean > medianIncome, wealth, wait times, house prices
Left-skewedLong tail to the left; mean < medianAn easy exam where most people score high
Symmetric (normal)Balanced both sides; mean ≈ medianAdult heights, measurement errors
BimodalTwo separate peaksTwo hidden groups mixed together
Analogy (the one to remember): Bill Gates walks into a small bar. The instant he steps in, the average (mean) net worth of everyone in the bar shoots into the billions — but the median (the middle person) barely moves. That single billionaire is the long right tail. This is why news about income or wealth should report the median, not the mean: the mean gets dragged around by extremes.

A bimodal distribution — two peaks — is a useful warning sign. It often means two different groups got mixed together (for example, the heights of men and women combined). When you see two humps, your instinct should be: "I should probably split this data into two groups."

Key takeaway: Always plot your data before trusting a summary number. The shape (normal, skewed, bimodal) tells you which average to use and whether a single number is even honest.

11.2 Probability: a number for "how likely"

Probability is a number between 0 and 1 that measures how likely an event is. 0 means impossible, 1 means certain, 0.5 means "as likely as not."

Analogy: A fair coin flip is 0.5 — there's no reason to favor heads over tails. A fair six-sided die landing on a 4 is 1/6 ≈ 0.17.

Three rules cover most everyday probability:

  1. All outcomes sum to 1. If there's a 30% chance of rain, there's a 70% chance of no rain. The possibilities must add up to 100%.
  2. "Either/or" events: add. The chance of rolling a 1 or a 2 on a die is 1/6 + 1/6 = 2/6 (when the outcomes can't both happen at once).
  3. "Both" independent events: multiply. The chance of flipping two coins and getting heads and then heads again is 0.5 × 0.5 = 0.25.

Independence and the Gambler's Fallacy

Independent events
Two events are independent if one happening doesn't change the odds of the other.
Analogy: A coin has no memory. If you flip 5 heads in a row, the next flip is still exactly 50/50. The coin doesn't know it "owes" you a tails.
Common mistake — the Gambler's Fallacy: Believing you're "due" for a win after a streak. After five reds at the roulette wheel, black is not more likely on the next spin. Each spin is independent. Casinos profit handsomely from people who think otherwise.

Conditional probability: when knowing one thing changes the odds

Not all events are independent. Often, knowing that one thing happened changes how likely something else is. This is conditional probability, written P(A|B) — read aloud as "the probability of A given B."

Analogy: On any random day, the chance it's raining might be low. But the chance it's raining given that you just saw a dozen people walking by with open umbrellas is much higher. The condition — seeing umbrellas — changes the odds.

Conditional probability is the doorway to the most important idea in this chapter.

11.3 Bayes' theorem: updating beliefs with evidence

This is the conceptual heart of the whole subject, so we'll go slowly. Bayes' theorem is simply a rule for updating a belief when new evidence arrives. Named after Thomas Bayes and developed by Pierre-Simon Laplace, it answers: "Given what I knew before, and given this new clue, what should I believe now?"

Plain framing: Your new belief = your old belief, adjusted by how strong the new evidence is. You don't throw away what you knew; you nudge it.
Prior
Your belief before seeing the new evidence — often just the base rate, "how common is this thing in general?"
Posterior
Your updated belief after folding in the evidence.

The rare-disease test (work through this carefully)

This is the single example that makes Bayesian thinking click. It catches almost everyone — including most doctors.

Example: A disease affects 1 in 1,000 people. There's a test for it. The test has a 5% false-positive rate — meaning if you're healthy, there's a 5% chance it wrongly says "positive." You take the test. It comes back positive. What's the chance you actually have the disease?

Most people answer "about 95%." The real answer is around 2%. Here's why — and the trick is to think in terms of actual people, not percentages.

Imagine 100,000 people get tested.

  100,000 people tested
  |
  |-- 100 actually SICK  (1 in 1,000)
  |     `-- ~100 test positive  (true positives)
  |
  `-- 99,900 actually HEALTHY
        `-- 5% test positive
            = ~4,995 false positives

  Total positives = 100 + 4,995 = 5,095
  Truly sick among positives = 100 / 5,095 ≈ 2%

The disease is so rare that the tiny sick group produces only about 100 true positives, while the enormous healthy group — even with just a 5% error rate — produces nearly 5,000 false alarms. So a positive result is far more likely to be a false alarm than a real case. Your posterior belief (~2%) is much lower than the test's "95% accuracy" suggested, because your prior (1 in 1,000) was very low.

Key takeaway: A positive test does not mean "you probably have it." It depends massively on how common the disease is to begin with — the base rate. Bayesian thinking means always asking "how common was this before the evidence arrived?"

This isn't just about medicine. It's how a detective weighs a clue, how a spam filter judges an email, and how you should react to a single dramatic headline. Strong evidence against a very unlikely prior still leaves you fairly uncertain.

Best practice: When you get a surprising result, update your belief in proportion to the evidence — don't lurch from "no way" to "definitely." Start from the base rate, then move. This habit alone prevents a huge number of bad decisions.

11.4 From sample back to population

We almost never measure everyone. We measure a sample — a subset — and use it to talk about the whole population — everyone we care about.

Analogy: To judge a whole pot of soup, you taste one spoonful. But this only works if the soup is well stirred. A badly stirred pot — all the salt sunk to the bottom — will fool your spoonful completely.

"Well stirred" means random sampling: every member of the population has a known chance of being picked, so the sample mirrors the whole.

When the spoonful lies: sampling bias

Sampling bias
When your method of choosing the sample systematically over- or under-represents some group.
Example — the 1936 Literary Digest poll: The magazine mailed millions of surveys and confidently predicted Alf Landon would beat Franklin Roosevelt. Roosevelt won in a landslide. The problem: they drew names from telephone directories and car-registration lists — and in 1936, owning a phone and a car meant you were wealthy. Their giant sample systematically missed poorer voters, who favored Roosevelt. A huge sample didn't save them, because it was biased.
Common mistake: Believing a bigger sample is automatically a better one. A bigger sample reduces random noise, but it does not fix bias. A huge biased sample is still wrong — just confidently wrong.

Survivorship bias: counting only the survivors

Example — Abraham Wald and the WWII planes: During World War II, the military studied bombers returning from missions and saw bullet holes concentrated on the wings and tail. The instinct was to add armor there. The statistician Abraham Wald pointed out the opposite: armor the engines and cockpit — the places with few holes. Why? Because the planes hit there never made it back to be counted. The data only included survivors.

Survivorship bias is everywhere: studying only successful founders to learn "the habits of success" ignores the identical-habit founders who failed. The losers aren't in the room.

The sampling distribution and the Central Limit Theorem

Here's a subtle but powerful idea. If you took many different samples and computed the average of each, those averages would themselves form a distribution — the sampling distribution.

Analogy: Imagine 1,000 different pollsters each survey 1,000 people. You'd get 1,000 slightly different averages. Lined up in a histogram, those 1,000 averages form their own bell-shaped pile.

This leads to one of the most remarkable results in statistics, the Central Limit Theorem (CLT):

Key takeaway (CLT): The average of a large enough sample is approximately normally distributed (bell-shaped) — even if the underlying data is not. Skewed, lumpy, weird data still produces bell-shaped averages.
Analogy: Roll one die and the result is flat — each face equally likely (a uniform distribution, not a bell). But roll ten dice and take the average, over and over, and a bell curve emerges. The averaging smooths out the lumpiness.
Common mistake: "n ≥ 30 always makes things normal." This is a rough rule of thumb, not a law. Mildly skewed data may need only a small sample, but heavily skewed data (like income) can need a much larger n before its averages look normal. Don't treat 30 as a magic number.

Standard deviation vs standard error — don't confuse them

These two are mixed up constantly, even by professionals.

Standard deviation (SD)
The spread of the individual data points — how far typical people sit from the average.
Standard error (SE)
The spread of the sample averages — how much your estimate of the average wobbles from sample to sample.

SD describes the spread of people; SE describes the spread of averages-of-people. Crucially, SE shrinks as your sample grows (roughly SD divided by the square root of n). Bigger samples give steadier, more trustworthy estimates of the average — even though the underlying people are just as varied as ever.

11.5 Inference: estimates, confidence, and significance

Once you accept that a sample is a noisy peek at a population, you face two jobs: estimate a quantity (with a range around it), or decide whether an effect is real.

Confidence intervals: report a range, not a point

Confidence interval (CI)
A range built by a method that captures the true value a stated percentage of the time (e.g., 95%).
Analogy: Instead of saying "the true average is exactly 502," you say "the true average is somewhere in this lane — and our lane-drawing method is right 95% of the time." The width of the lane is your honesty about uncertainty.
Common mistake: Reading a 95% CI as "there's a 95% probability the true value is inside this specific interval." That's not what it means. The 95% is a property of the method across many samples: if you repeated the whole process many times, about 95% of the intervals you'd draw would contain the truth. Any single interval either contains it or doesn't.
Best practice: Always report a range, not just a point estimate. "Average satisfaction is 7.2 (95% CI: 6.8–7.6)" is far more honest than a bare "7.2." Modern statistical guidance favors confidence intervals over bare p-values.

Hypothesis testing and the p-value

The other job is deciding whether an effect is real. The standard machinery starts with a null hypothesis.

Null hypothesis
The default assumption that there is "no effect" or "no difference."
Analogy: A courtroom presumes the defendant innocent (the null). You don't convict unless the evidence is strong enough to overturn that presumption. In statistics, you don't claim an effect is real until the data is surprising enough under "no effect."

How surprising? That's the p-value.

p-value
The probability of seeing data at least this extreme if the null hypothesis were true.
Analogy: If a coin were truly fair, how shocking is the streak I just saw? If you flip 9 heads in a row, a fair coin would do that only about 0.2% of the time — a tiny p-value. That makes you doubt the coin is fair.

The p-value is one of the most misunderstood numbers in all of science. The American Statistical Association issued a formal statement in 2016 to correct the abuse. Here is what a p-value is not:

Common mistake (ASA-verified): A p-value is NOT the probability that the hypothesis is true. It is NOT the probability that your result was "due to chance." It is NOT a measure of how big or important the effect is. And p > 0.05 does NOT prove there's no effect — absence of evidence is not evidence of absence.

Statistical significance — and its limits

Statistical significance means the p-value fell below a pre-chosen threshold, called alpha (α), commonly 0.05. Below it, we "reject the null."

Analogy: Think of α as the sensitivity dial on a smoke alarm. Set it too sensitive and you get false alarms from burnt toast (a false positive, or Type I error). Set it too dull and you miss a real fire (a false negative, or Type II error). The 0.05 threshold is an arbitrary convention popularized by Ronald Fisher — not a law of nature.
Effect is really thereNo real effect
Test says "yes"Correct (good)Type I error (false positive)
Test says "no"Type II error (false negative)Correct (good)

Power is a test's ability to detect a real effect when there is one. Larger samples generally give more power — they make a real effect easier to spot above the noise.

Common mistake — confusing statistical and practical significance: A result can be statistically significant yet trivially small. Imagine a diet pill tested on a million people: it makes you lose 0.1 kg with p < 0.001. The effect is "real" and "significant" — and utterly useless. Significance tells you an effect probably isn't zero; it does not tell you the effect is big enough to care about.
Effect size
How big the effect is — separate from whether it's statistically significant.
Best practice: Lead with the effect size and a confidence interval — "how big is it, and does it matter?" — then mention significance. State your hypothesis and sample size before you look at the data, so you can't fish around for a flattering result.

11.6 Quick reference: the uncertainty toolkit

ConceptOne-line meaningWatch out for
Normal distributionSymmetric bell curve; 68–95–99.7 ruleNot all data is normal
SkewLopsided data; right-skew → mean > medianUse median for skewed data
ProbabilityLikelihood from 0 to 1Add for "or," multiply for independent "and"
IndependenceOne event doesn't affect the otherGambler's fallacy
Conditional P(A|B)Odds of A given B happenedThe condition changes the odds
Bayes / priorUpdate belief = old belief + evidenceBase-rate neglect
Sampling biasSample misrepresents the populationSize doesn't fix bias
CLTSample averages tend toward a bell curve"n≥30" is a heuristic, not a law
SD vs SESpread of people vs wobble of the averageConstantly confused
Confidence intervalPlausible range for the true valueIt's a property of the method
p-valueHow surprising the data is if null is trueNot the chance the hypothesis is true
Effect sizeHow big the effect isSignificance ≠ importance

11.7 Putting it together

Uncertainty isn't a flaw in your knowledge to be ashamed of — it's a quantity to measure and reason with. The skilled thinker doesn't flip between "I'm certain" and "I have no idea." Instead they hold a belief with a confidence level, and they move that confidence up or down as evidence arrives.

Three habits carry most of the value of this chapter:

  1. Plot the data and look at its shape before trusting any single number. Skew and bimodality hide inside innocent-looking averages.
  2. Start from the base rate when judging any test, alarm, or surprising result. A scary positive against a rare condition is usually a false alarm — that's just Bayes' theorem at work.
  3. Report ranges, not points; effect sizes, not just significance. Treat statistics as a tool for deciding under uncertainty, never as a machine that prints "truth."
Key takeaway: Probability is the grammar of uncertainty, and Bayesian updating is its most useful sentence: your new belief is your old belief, adjusted by how strong the evidence is. Master that one move and most statistical confusion — false alarms, gambler's fallacies, overreactions to single results — quietly dissolves.

In the next chapter we turn from single variables and beliefs to relationships between variables: correlation, the famous trap of mistaking it for causation, and the fallacies and chart tricks that fool even careful readers.

Continue reading