Statistics Literacy

Numbers seem objective, but they can mislead. Here's how to think critically about statistics.

Correlation vs Causation

The most important concept: correlation does not prove causation.

Correlation^ⓘ

Causation^ⓘ

Mendelian randomization helps establish causation in observational studies^[3].

### Confounding Variables

A confounder^ⓘ creates the appearance of a relationship that doesn't exist.

Common confounders:

Age: Older people have more of almost everything (disease, income, grey hair)
Socioeconomic status: Affects health, behavior, and opportunity
Geography: Climate, culture, and local conditions
Season: Affects activity, vitamin D, mood

Understanding P-Values

Myth: A P-value of 0.05 means there's a 95% chance the finding is true
Reality: A P-value is the probability of getting results at least as extreme IF the null hypothesis is true. A P-value of 0.05 means 5% chance of seeing this result by random chance. It does NOT mean there's a 95% chance the finding is true.

P-values are ubiquitous in published research—and almost always significant^[5].

### Statistical vs Practical Significance

A result can be statistically significant but practically meaningless:

Drug reduces blood pressure by 0.5 mmHg (P < 0.01)
This is too small to matter clinically, despite being "significant"

Always ask: How big is the effect?

Effect size predicts replication better than sample size^[6].

The Replication Crisis

Many published findings may be false positives^[7].

Why findings don't replicate:

Publication bias: Only positive results get published
P-hacking: Analyzing data multiple ways until something is "significant"
Small samples: More prone to random flukes
Flexibility: Many researcher choices affect results

P-hacking inflates false positive rates^[8].

How Language Obscures Results

Researchers use creative language to spin non-significant results^[9].

Watch for spin language:

"Trending towards significance" = Not significant
"Marginally significant" = Not significant
"Approached significance" = Not significant
"Nearly significant" = Not significant
"Suggestive of an effect" = Not significant

If it's not significant, it's not significant. There's no "almost."

Understanding Risk

### Relative vs Absolute Risk

Headlines love relative risk because it sounds dramatic:

"Drug reduces heart attack risk by 50%!"

But absolute risk tells the real story:

Risk went from 2% to 1%
One in 100 people benefit, 99 see no difference

How risk is presented affects understanding^[10].

Always ask for absolute numbers.

### Base Rate Neglect

People often ignore how common something is^[11].

Example: A disease test is 99% accurate.

If 1 in 1000 people have the disease
And you test positive
What's the chance you actually have it?

Answer: Only about 9%! Most positive tests are false positives because the disease is rare.

Base rate^ⓘ

Sample Size and Selection

### Small Samples Are Noisy

Small studies produce extreme results in both directions. The "best" and "worst" schools/hospitals/products are often just small ones with random variation.

When sample size is small, treat all findings skeptically.

### Selection Bias

How samples are chosen matters:

Survivors only: We only see successful startups, not the failures
Volunteers: People who participate differ from those who don't
Available data: Using what's easy to measure, not what matters
Healthy user bias: People who take supplements may be healthier overall

Critical Thinking About Statistics

Training in logical fallacies improves ability to detect misinformation^[13].

### Questions to Ask

1. What's the sample? Who was included? Who was excluded?

2. What's the comparison? Compared to what? No comparison = no conclusion

3. Could there be confounders? What else might explain this?

4. How big is the effect? Statistical significance ≠ practical importance

5. Has it replicated? One study proves nothing

6. Who benefits? Industry-funded studies often favor funders

### Red Flags

No comparison group
Percentages without absolute numbers
"Studies show" without citations
Single study presented as definitive
Correlation presented as causation
Cherry-picked time periods
Dramatic claims from small samples

Interaction Effects

Interactions between factors are often misunderstood^[14].

Interaction^ⓘ

When two factors interact, you can't simply add their individual effects.

Visualizing Data

Good visualizations help understanding. Bad ones mislead.

Common tricks:

Y-axis not starting at zero (exaggerates differences)
Inconsistent scales between graphs
Cherry-picked time ranges
3D effects that distort proportions
Unlabeled axes

Always check the axes and scale of any graph before drawing conclusions.

Summary: Statistical Self-Defense

1. Correlation ≠ causation — Look for confounders

2. P-values have limits — Significant doesn't mean important

3. Effect size matters — How big is the effect?

4. Replication required — One study proves nothing

5. Watch the spin — "Trending" means "failed"

6. Absolute > relative risk — Demand real numbers

7. Check base rates — Rare events have more false positives

8. Sample size matters — Small studies are noisy

9. Selection bias everywhere — How was the sample chosen?

10. Follow the money — Who funded this research?

---

References

['Davey Smith G', 'Ebrahim S'] (2008). Mendelian Randomisation and Causal Inference in Observational Epidemiology. PLOS Medicine. [DOI]
['Chavalarias D', 'Wallach JD', 'Li AH', 'Ioannidis JP'] (2018). P values in display items are ubiquitous and almost invariably significant: A survey of top science journals. PLOS ONE. [DOI]
['Voracek M', 'Tran US', 'Formann AK'] (2024). Challenging the N-Heuristic: Effect size, not sample size, predicts the replicability of psychological science. PLOS ONE. [DOI]
['van Zwet EW', 'Cator EA'] (2023). Are most published research findings false? Trends in statistical power, publication selection bias, and the false discovery rate in psychology (1975–2017). PLOS ONE. [DOI]
['Costa-Font J', 'Bover-Bover A'] (2024). Impact of redefining statistical significance on P-hacking and false positive rates: An agent-based model. PLOS ONE. [DOI]
['Barnett AG', 'Wren JD'] (2022). Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance. PLOS Biology. [DOI]
['Okan Y', 'Garcia-Retamero R', 'Cokely ET', 'Maldonado A'] (2021). Comparing the impact of an icon array versus a bar graph on preference and understanding of risk information. PLOS ONE. [DOI]
['Binder K', 'Krauss S', 'Bruckmaier G'] (2018). Visualizing the Bayesian 2-test case: The effect of tree diagrams on medical decision making. PLOS ONE. [DOI]
['Halpern DF', 'Butler HA'] (2023). Learning about informal fallacies and the detection of fake news: An experimental intervention. PLOS ONE. [DOI]
['VanderWeele TJ', 'Knol MJ'] (2021). Understanding interactions between risk factors, and assessing the utility of the additive and multiplicative models through simulations. PLOS ONE. [DOI]