Statistics Literacy
Numbers seem objective, but they can mislead. Here's how to think critically about statistics.
Correlation vs Causation
The most important concept: correlation does not prove causation.
Correlationβ
Causationβ
Mendelian randomization helps establish causation in observational studies[3].
### Confounding Variables
A confounderβ creates the appearance of a relationship that doesn't exist.
Common confounders:
- Age: Older people have more of almost everything (disease, income, grey hair)
- Socioeconomic status: Affects health, behavior, and opportunity
- Geography: Climate, culture, and local conditions
- Season: Affects activity, vitamin D, mood
Understanding P-Values
Reality: A P-value is the probability of getting results at least as extreme IF the null hypothesis is true. A P-value of 0.05 means 5% chance of seeing this result by random chance. It does NOT mean there's a 95% chance the finding is true.
P-values are ubiquitous in published researchβand almost always significant[5].
### Statistical vs Practical Significance
A result can be statistically significant but practically meaningless:
- Drug reduces blood pressure by 0.5 mmHg (P < 0.01)
- This is too small to matter clinically, despite being "significant"
Always ask: How big is the effect?
Effect size predicts replication better than sample size[6].
The Replication Crisis
Many published findings may be false positives[7].
Why findings don't replicate:
- Publication bias: Only positive results get published
- P-hacking: Analyzing data multiple ways until something is "significant"
- Small samples: More prone to random flukes
- Flexibility: Many researcher choices affect results
P-hacking inflates false positive rates[8].
How Language Obscures Results
Researchers use creative language to spin non-significant results[9].
Watch for spin language:
- "Trending towards significance" = Not significant
- "Marginally significant" = Not significant
- "Approached significance" = Not significant
- "Nearly significant" = Not significant
- "Suggestive of an effect" = Not significant
If it's not significant, it's not significant. There's no "almost."
Understanding Risk
### Relative vs Absolute Risk
Headlines love relative risk because it sounds dramatic:
- "Drug reduces heart attack risk by 50%!"
But absolute risk tells the real story:
- Risk went from 2% to 1%
- One in 100 people benefit, 99 see no difference
How risk is presented affects understanding[10].
Always ask for absolute numbers.
### Base Rate Neglect
People often ignore how common something is[11].
Example: A disease test is 99% accurate.
- If 1 in 1000 people have the disease
- And you test positive
- What's the chance you actually have it?
Answer: Only about 9%! Most positive tests are false positives because the disease is rare.
Base rateβ
Sample Size and Selection
### Small Samples Are Noisy
Small studies produce extreme results in both directions. The "best" and "worst" schools/hospitals/products are often just small ones with random variation.
When sample size is small, treat all findings skeptically.
### Selection Bias
How samples are chosen matters:
- Survivors only: We only see successful startups, not the failures
- Volunteers: People who participate differ from those who don't
- Available data: Using what's easy to measure, not what matters
- Healthy user bias: People who take supplements may be healthier overall
Critical Thinking About Statistics
Training in logical fallacies improves ability to detect misinformation[13].
### Questions to Ask
1. What's the sample? Who was included? Who was excluded?
2. What's the comparison? Compared to what? No comparison = no conclusion
3. Could there be confounders? What else might explain this?
4. How big is the effect? Statistical significance β practical importance
5. Has it replicated? One study proves nothing
6. Who benefits? Industry-funded studies often favor funders
### Red Flags
- No comparison group
- Percentages without absolute numbers
- "Studies show" without citations
- Single study presented as definitive
- Correlation presented as causation
- Cherry-picked time periods
- Dramatic claims from small samples
Interaction Effects
Interactions between factors are often misunderstood[14].
Interactionβ
When two factors interact, you can't simply add their individual effects.
Visualizing Data
Good visualizations help understanding. Bad ones mislead.
Common tricks:
- Y-axis not starting at zero (exaggerates differences)
- Inconsistent scales between graphs
- Cherry-picked time ranges
- 3D effects that distort proportions
- Unlabeled axes
Always check the axes and scale of any graph before drawing conclusions.
Summary: Statistical Self-Defense
1. Correlation β causation β Look for confounders
2. P-values have limits β Significant doesn't mean important
3. Effect size matters β How big is the effect?
4. Replication required β One study proves nothing
5. Watch the spin β "Trending" means "failed"
6. Absolute > relative risk β Demand real numbers
7. Check base rates β Rare events have more false positives
8. Sample size matters β Small studies are noisy
9. Selection bias everywhere β How was the sample chosen?
10. Follow the money β Who funded this research?
---
References
- ['Davey Smith G', 'Ebrahim S'] (2008). Mendelian Randomisation and Causal Inference in Observational Epidemiology. PLOS Medicine. [DOI]
- ['Chavalarias D', 'Wallach JD', 'Li AH', 'Ioannidis JP'] (2018). P values in display items are ubiquitous and almost invariably significant: A survey of top science journals. PLOS ONE. [DOI]
- ['Voracek M', 'Tran US', 'Formann AK'] (2024). Challenging the N-Heuristic: Effect size, not sample size, predicts the replicability of psychological science. PLOS ONE. [DOI]
- ['van Zwet EW', 'Cator EA'] (2023). Are most published research findings false? Trends in statistical power, publication selection bias, and the false discovery rate in psychology (1975β2017). PLOS ONE. [DOI]
- ['Costa-Font J', 'Bover-Bover A'] (2024). Impact of redefining statistical significance on P-hacking and false positive rates: An agent-based model. PLOS ONE. [DOI]
- ['Barnett AG', 'Wren JD'] (2022). Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance. PLOS Biology. [DOI]
- ['Okan Y', 'Garcia-Retamero R', 'Cokely ET', 'Maldonado A'] (2021). Comparing the impact of an icon array versus a bar graph on preference and understanding of risk information. PLOS ONE. [DOI]
- ['Binder K', 'Krauss S', 'Bruckmaier G'] (2018). Visualizing the Bayesian 2-test case: The effect of tree diagrams on medical decision making. PLOS ONE. [DOI]
- ['Halpern DF', 'Butler HA'] (2023). Learning about informal fallacies and the detection of fake news: An experimental intervention. PLOS ONE. [DOI]
- ['VanderWeele TJ', 'Knol MJ'] (2021). Understanding interactions between risk factors, and assessing the utility of the additive and multiplicative models through simulations. PLOS ONE. [DOI]