Statistics Literacy
Numbers seem objective, but they can mislead. Here's how to think critically about statistics.
Correlation vs Causation
The most important concept: correlation does not prove causation.
Correlationβ
Causationβ
Mendelian randomization helps establish causation in observational studies[1].
Confounding Variables
A confounderβ creates the appearance of a relationship that doesn't exist.
Common confounders:
- Age: Older people have more of almost everything (disease, income, grey hair)
- Socioeconomic status: Affects health, behavior, and opportunity
- Geography: Climate, culture, and local conditions
- Season: Affects activity, vitamin D, mood
Understanding P-Values
Reality: A P-value is the probability of getting results at least as extreme IF the null hypothesis is true. A P-value of 0.05 means 5% chance of seeing this result by random chance. It does NOT mean there's a 95% chance the finding is true.
P-values are ubiquitous in published researchβand almost always significant[2].
Statistical vs Practical Significance
A result can be statistically significant but practically meaningless:
- Drug reduces blood pressure by 0.5 mmHg (P < 0.01)
- This is too small to matter clinically, despite being "significant"
Always ask: How big is the effect?
Effect size predicts replication better than sample size[3].
The Replication Crisis
Many published findings may be false positives[4].
Why findings don't replicate:
- Publication bias: Only positive results get published
- P-hacking: Analyzing data multiple ways until something is "significant"
- Small samples: More prone to random flukes
- Flexibility: Many researcher choices affect results
P-hacking inflates false positive rates[5].
How Language Obscures Results
Researchers use creative language to spin non-significant results[6].
Watch for spin language:
- "Trending towards significance" = Not significant
- "Marginally significant" = Not significant
- "Approached significance" = Not significant
- "Nearly significant" = Not significant
- "Suggestive of an effect" = Not significant
If it's not significant, it's not significant. There's no "almost."
Understanding Risk
Relative vs Absolute Risk
Headlines love relative risk because it sounds dramatic:
- "Drug reduces heart attack risk by 50%!"
But absolute risk tells the real story:
- Risk went from 2% to 1%
- One in 100 people benefit, 99 see no difference
How risk is presented affects understanding[7].
Always ask for absolute numbers.
Base Rate Neglect
People often ignore how common something is[8].
Example: A disease test is 99% accurate.
- If 1 in 1000 people have the disease
- And you test positive
- What's the chance you actually have it?
Answer: Only about 9%! Most positive tests are false positives because the disease is rare.
Base rateβ
Sample Size and Selection
Small Samples Are Noisy
Small studies produce extreme results in both directions. The "best" and "worst" schools/hospitals/products are often just small ones with random variation.
When sample size is small, treat all findings skeptically.
Selection Bias
How samples are chosen matters:
- Survivors only: We only see successful startups, not the failures
- Volunteers: People who participate differ from those who don't
- Available data: Using what's easy to measure, not what matters
- Healthy user bias: People who take supplements may be healthier overall
Critical Thinking About Statistics
Training in logical fallacies improves ability to detect misinformation[9].
Questions to Ask
1. What's the sample? Who was included? Who was excluded?
2. What's the comparison? Compared to what? No comparison = no conclusion
3. Could there be confounders? What else might explain this?
4. How big is the effect? Statistical significance β practical importance
5. Has it replicated? One study proves nothing
6. Who benefits? Industry-funded studies often favor funders
Red Flags
- No comparison group
- Percentages without absolute numbers
- "Studies show" without citations
- Single study presented as definitive
- Correlation presented as causation
- Cherry-picked time periods
- Dramatic claims from small samples
Interaction Effects
Interactions between factors are often misunderstood[10].
Interactionβ
When two factors interact, you can't simply add their individual effects.
Visualizing Data
Good visualizations help understanding. Bad ones mislead.
Common tricks:
- Y-axis not starting at zero (exaggerates differences)
- Inconsistent scales between graphs
- Cherry-picked time ranges
- 3D effects that distort proportions
- Unlabeled axes
Always check the axes and scale of any graph before drawing conclusions.
Summary: Statistical Self-Defense
1. Correlation β causation β Look for confounders
2. P-values have limits β Significant doesn't mean important
3. Effect size matters β How big is the effect?
4. Replication required β One study proves nothing
5. Watch the spin β "Trending" means "failed"
6. Absolute > relative risk β Demand real numbers
7. Check base rates β Rare events have more false positives
8. Sample size matters β Small studies are noisy
9. Selection bias everywhere β How was the sample chosen?
10. Follow the money β Who funded this research?
---
References
- ['Davey Smith G', 'Ebrahim S'] (2008). Mendelian Randomisation and Causal Inference in Observational Epidemiology. PLOS Medicine. [DOI]
- ['Chavalarias D', 'Wallach JD', 'Li AH', 'Ioannidis JP'] (2018). P values in display items are ubiquitous and almost invariably significant: A survey of top science journals. PLOS ONE. [DOI]
- ['Voracek M', 'Tran US', 'Formann AK'] (2024). Challenging the N-Heuristic: Effect size, not sample size, predicts the replicability of psychological science. PLOS ONE. [DOI]
- ['van Zwet EW', 'Cator EA'] (2023). Are most published research findings false? Trends in statistical power, publication selection bias, and the false discovery rate in psychology (1975β2017). PLOS ONE. [DOI]
- ['Costa-Font J', 'Bover-Bover A'] (2024). Impact of redefining statistical significance on P-hacking and false positive rates: An agent-based model. PLOS ONE. [DOI]
- ['Barnett AG', 'Wren JD'] (2022). Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance. PLOS Biology. [DOI]
- ['Okan Y', 'Garcia-Retamero R', 'Cokely ET', 'Maldonado A'] (2021). Comparing the impact of an icon array versus a bar graph on preference and understanding of risk information. PLOS ONE. [DOI]
- ['Binder K', 'Krauss S', 'Bruckmaier G'] (2018). Visualizing the Bayesian 2-test case: The effect of tree diagrams on medical decision making. PLOS ONE. [DOI]
- ['Halpern DF', 'Butler HA'] (2023). Learning about informal fallacies and the detection of fake news: An experimental intervention. PLOS ONE. [DOI]
- ['VanderWeele TJ', 'Knol MJ'] (2021). Understanding interactions between risk factors, and assessing the utility of the additive and multiplicative models through simulations. PLOS ONE. [DOI]