Reading the World in Numbers: Data Types, Averages, Spread, and Charts
Every day, the world hands you numbers. A weather app says "70% chance of rain." A news site reports the "average salary" in your city. A product page boasts "4.6 stars from 2,000 reviews." A chart on the news shows a line shooting up. These numbers shape what you believe and what you decide. But raw numbers, by themselves, do not speak. You have to learn to read them — and to spot when someone is using them to mislead you.
This chapter assumes you know nothing about statistics. We start from the very bottom: what data even is. Then we learn how to summarize a pile of numbers with a single value (the average), how to describe how "spread out" those numbers are, and finally how to read the most common kind of chart. By the end, you will look at a number in the news and instinctively ask the right questions.
10.1 What is data, anyway?
Let's start with the most basic word.
- Data
- Recorded observations about the world. Anything you write down, measure, or count is data — the ages of people in a room, the price of milk each week, the colors of cars in a parking lot.
- Variable
- A thing that varies from one observation to the next. Age is a variable because different people have different ages. Eye color is a variable. Yesterday's high temperature is a variable.
The easiest way to picture data is as a spreadsheet. Each row is one thing you observed (one person, one sale, one day). Each column is one variable (their age, their height, how they voted).
| age | height_cm | favorite_color
--------+-----+-----------+----------------
Asha | 31 | 165 | blue
Ben | 47 | 180 | green
Carmen | 22 | 158 | blue
Dev | 39 | 172 | red
^ ^ ^
each column = one variable
each row = one observation (one person)
The two big kinds of variables
Before you can do anything with a variable, you must know what type it is. This decides which tools you are allowed to use.
- Categorical variable
- A variable that puts each observation into a category, not a number. Examples: eye color (blue/brown/green), whether someone is a customer (yes/no), country, blood type. You cannot meaningfully add or average these — there is no "average eye color."
- Numeric variable
- A variable measured as a number, where the size of the number means something. Examples: age, weight, salary, temperature. You can add and average these.
Numeric variables split further into two kinds:
- Discrete
- Whole counts that can't be split. The number of children in a family (you can have 2 or 3, never 2.4). The number of cars sold.
- Continuous
- Any value on a scale, including fractions. Height (165.3 cm), weight, time, temperature. Between any two values there's always another possible value.
10.2 The whole pot vs. the spoonful: population and sample
Here is the deepest idea in all of statistics, and it's surprisingly simple. You almost never get to measure everything. So you measure a piece, and use the piece to talk about the whole.
- Population
- Everyone (or everything) you actually care about. "All adults in Canada." "Every product our factory makes this year."
- Sample
- The subset you actually measure. "The 1,000 Canadians we phoned." "The 50 products we pulled off the line to inspect."
This soup image will come back later. For now, hold onto one lesson hidden inside it: the spoonful only works if the soup is well stirred. If all the salt sank to the bottom and you taste from the top, your spoonful lies. A sample is only trustworthy if it fairly represents the population — a theme we'll return to.
We won't go deep into sampling in this foundations chapter. Just remember the two words and the soup. They are the frame for everything else.
10.3 Summarizing the middle: mean, median, and mode
Suppose you have a column of 1,000 salaries. Nobody can hold 1,000 numbers in their head. We want one number that captures "the typical value." This is called a measure of central tendency — a fancy phrase for "where's the middle?" There are three, and choosing the right one matters enormously.
The mean (the everyday "average")
- Mean
- Add up all the values, then divide by how many there are. This is what most people mean when they say "average."
If five friends earn $30k, $35k, $40k, $45k, and $50k, the mean is (30+35+40+45+50) ÷ 5 = 200 ÷ 5 = $40k.
The mean has one serious weakness: it gets dragged by extremes. One very large or very small value can yank it far from where most of the data actually sits.
The median (the true middle)
- Median
- Sort all the values from smallest to largest, then take the one in the exact middle. Half the values are below it, half above.
For our five salaries sorted (30, 35, 40, 45, 50), the middle one is $40k. Here mean and median agree. But watch what happens next.
Why mean ≠ median is the most important lesson here
Keep our five friends earning $30k–$50k. Now Bill Gates walks into the room. Say his income that year is $5 billion.
| The 5 friends | + Bill Gates (6 people) | |
|---|---|---|
| Mean | $40k | ~$833 million |
| Median | $40k | ~$42.5k |
The mean now says the "average" person in the room is a multi-millionaire — which is absurd. Five of the six people are nowhere near it. The median barely budged, to about $42.5k, and still honestly describes a typical person in the room.
The mode (the most common)
- Mode
- The value that appears most often.
The mode is the only average that works for categorical data. There's no mean or median eye color, but there can be a most common one. If a shop's customers are mostly paying with cards, "card" is the mode of the payment-method variable.
| Measure | What it is | Best for | Weakness |
|---|---|---|---|
| Mean | Add up, divide | Roughly symmetric numeric data | Dragged by outliers |
| Median | Middle when sorted | Skewed data (income, prices) | Ignores exact size of extremes |
| Mode | Most frequent | Categories; spotting the common case | Useless for "the middle" of numbers |
10.4 How spread out is the data? Range, variance, and standard deviation
Two groups can have the exact same average yet be totally different. Consider two classrooms that both average 70% on a test:
- Class A: everyone scored between 68% and 72%. Calm and uniform.
- Class B: half scored 95%, half scored 45%. Wildly split.
Same average, completely different reality. The average alone hides this. To see it, we need a measure of spread — how much the values differ from each other.
Range — the crude first attempt
- Range
- The largest value minus the smallest. The simplest possible spread.
If test scores run from 45% to 95%, the range is 50 points. Easy. But the range is fragile: it depends entirely on two values. One freak outlier and the range explodes, even if every other point is tightly clustered.
Variance — the average squared distance
We want a measure that uses all the data, not just the two ends. The idea: for each value, measure how far it sits from the mean. Then average those distances. There's one twist — we square each distance first (multiply it by itself).
- Variance
- The average of the squared distances of each value from the mean.
Why square? Two reasons. First, squaring makes every distance positive (a value 5 below the mean and a value 5 above both contribute 25), so they don't cancel out. Second, squaring punishes far misses extra hard — being 10 away counts as 100, while being 2 away counts as only 4.
Variance has one annoying flaw: its units are squared. If your data is in dollars, the variance is in "dollars squared," which means nothing to a human. That awkwardness leads us to the fix.
Standard deviation — spread in normal units
- Standard deviation (SD)
- The square root of the variance. Taking the square root undoes the squaring, so the result is back in the original units. It answers: "on a typical day, how far from average do things land?"
The SD is the workhorse measure of spread. A small SD means values huddle close to the mean (Class A above). A large SD means they're scattered far and wide (Class B). Same mean, very different SD.
Class A (small SD) Class B (large SD)
mean = 70 mean = 70
### # #
##### # #
####### # #
66 68 70 72 74 45 ... 95
tight cluster two far-apart clumps
10.5 Slicing the data into hundredths: percentiles, quartiles, and IQR
Another way to describe data without being fooled by extremes is to talk about position.
- Percentile
- The k-th percentile is the value below which k% of the data falls.
The median you already met is simply the 50th percentile — half below, half above. Three percentiles are used so often they get their own names, the quartiles (because they cut the sorted data into four equal quarters):
- Q1 (25th percentile): a quarter of the data is below this.
- Q2 (50th percentile): the median.
- Q3 (75th percentile): three-quarters of the data is below this.
- Interquartile range (IQR)
- Q3 minus Q1 — the spread of the middle 50% of the data.
The IQR is a robust spread measure: because it throws away the top 25% and bottom 25%, a single billionaire or typo can't blow it up the way it blows up the range. The IQR is the backbone of a chart called the box plot, which draws a box from Q1 to Q3 with a line at the median.
10.6 Outliers: the value that doesn't belong
- Outlier
- A data point that sits far away from all the others.
Beginners often want to just delete outliers to "clean up" the data. Resist. An outlier is a question, not a nuisance.
10.7 The shape of data: distributions and the bell curve
So far we've crushed data into single summary numbers. But the richest picture comes from seeing the whole shape. That's what a distribution is.
- Distribution
- The pattern of how often each value (or range of values) occurs in your data.
- Histogram
- The basic picture of a distribution: a bar chart where each bar's height shows how many observations fall into that value-range.
The normal distribution (the bell curve)
One shape appears so often in nature that it has a special name.
- Normal distribution
- A symmetric, single-peaked distribution where most values cluster near the middle and fewer and fewer appear as you move toward the extremes on either side. Its shape looks like a bell, so it's nicknamed the "bell curve."
.-=#####=-.
.-############-. The bell curve:
.################. most values near the
.####################. middle, few at the edges.
.########################.
-----|--------|--------|-----
-2 SD mean +2 SD
The bell curve has a beautiful, reliable property called the 68–95–99.7 rule. For data that's roughly normal:
- About 68% of values fall within 1 standard deviation of the mean.
- About 95% fall within 2 standard deviations.
- About 99.7% fall within 3 standard deviations.
Skew: when the bell tips over
Not all data is symmetric. Often one side has a long tail.
- Skew
- The lopsidedness of a distribution. A long tail to the right is right-skewed (a.k.a. positive skew); a long tail to the left is left-skewed.
Right-skewed data is everywhere: income, wealth, house prices, wait times, page views. Most values are modest, but a few enormous ones stretch a long tail to the right.
Here's the rule that ties skew back to averages: in right-skewed data, the long tail drags the mean up past the median. So mean > median is a tell-tale sign of right skew. (Left-skewed flips it: mean < median. A left-skew example is an easy exam where almost everyone scores high and a few low scores trail off to the left.)
Right-skewed (income): #### ###### ########_ ##########_____ ###################_________ | | median mean <- long tail pulls mean rightward
One more shape worth a name: a bimodal distribution has two peaks. That's usually a clue that two different groups got mixed together. For example, plotting "time to finish a task" might show two humps because beginners and experts are blended in one dataset. Two peaks mean: split the data and look at each group separately.
10.8 Reading charts without getting fooled
A chart turns numbers into a picture, and our brains read pictures fast — often before we read the labels. That speed is exactly what makes charts powerful and also makes them easy to abuse. Let's learn the honest charts first, then the dishonest tricks.
The charts you'll meet most
| Chart | What it shows | Good for |
|---|---|---|
| Bar chart | A value for each category | Comparing categories (sales by region) |
| Histogram | How often numeric values occur | Seeing the distribution / shape |
| Line chart | A value changing over time | Trends (revenue per month) |
| Box plot | Median, quartiles, outliers (IQR) | Comparing spread between groups |
A note on the difference between a bar chart and a histogram, since beginners mix them up: a bar chart compares separate categories (apples vs. oranges) and the bars have gaps. A histogram shows the distribution of one numeric variable, with bars touching to show a continuous range.
How charts lie
The same true data can be drawn to tell opposite stories. Here are the classic deceptions to watch for.
1. The truncated y-axis (not starting at zero). This is the most common trick of all. If the vertical axis starts at, say, 95 instead of 0, a tiny difference can look like a cliff.
Y-axis starts at 0: Y-axis starts at 98:
100| _ 100| ___
| _ | | 99| ___ | |
| | | | | 98| | | | |
50| | | | | +---+--+---+
| | | | | A: 99 B: 100
| | | | |
0+-+-+---+-+ "B is HUGE!" (it's a 1% gap)
A:99 B:100
"basically equal"
2. The cherry-picked time window. Show only the slice of time that supports the story. A stock that fell for ten years but rose last month can be drawn to look like a rocket — if you only plot last month. Always ask: "What does the longer trend look like?"
3. Dual axes that manufacture a fake link. Putting two unrelated lines on the same chart with two different vertical scales lets someone slide the scales until the lines appear to move together. Two lines tracking each other is meaningless if the axes were tuned to make it happen.
4. Misleading area and 3-D effects. If a designer doubles a circle's radius to show "twice as much," the circle's area actually grows fourfold — so the eye sees a quadrupling. 3-D pie charts tilt slices so the front ones look bigger. Our eyes judge area, and area is easy to distort.
5. Color and emphasis. Coloring one bar bright red while the rest are grey makes it pop regardless of its actual value. Emphasis steers your eye before you've read a single number.
This deception-spotting skill has a long pedigree. Back in 1954, Darrell Huff wrote a famous little book called How to Lie with Statistics, full of exactly these tricks. The fact that the same tricks still work 70 years later tells you how naturally our eyes are fooled — and how valuable it is to train them.
10.9 Putting it together: a worked reading
Let's apply the whole chapter to one realistic headline.
- Mean or median? Home prices are heavily right-skewed (a few mansions stretch the tail). The mean $850k could be pulled up by a handful of luxury sales, while the typical (median) home is far cheaper. Demand the median.
- What's the spread? A single "average" hides whether prices are tightly clustered or all over the map. What's the IQR — the middle 50%?
- Any outliers? Did one $40-million estate sale inflate the figure? Investigate before believing.
- Population or sample? Is this every sale this year, or a small sample? Was it representative — or only downtown?
- Show me the chart. If there's a line going up, does the y-axis start at zero, and is the time window long enough to be honest?
10.10 Chapter recap
- Data is recorded observations; a variable is what varies. Variables are categorical or numeric; numeric splits into discrete (counts) and continuous (any value).
- You usually measure a sample to learn about a population — the spoonful and the pot. The sample is trustworthy only if it fairly represents the whole.
- Three averages: mean (add and divide, but dragged by extremes), median (the honest middle for skewed data), mode (the most common, the only average for categories).
- Spread matters as much as the middle: range (crude), variance (average squared distance), and standard deviation (typical distance, in real units). Percentiles, quartiles, and IQR describe position robustly.
- Investigate outliers; never delete them blindly.
- See the distribution with a histogram. The normal (bell) curve follows the 68–95–99.7 rule. Skew warps things — in right-skewed data, mean > median.
- Charts are fast but easy to abuse. Watch for truncated axes, cherry-picked windows, dual axes, area tricks, and emphasis. Check the axes, the window, the numbers, and the source.