Statistical Significance Demystified: What Growth Managers Need to Know
Statistical Significance Demystified: What Growth Managers Need to Know
A/B test statistics don't have to be a nightmare. This guide explains key concepts without complex math — just what you actually need for sound decisions.
Why Statistics Matter
Imagine your A/B test shows +5% conversion rate for variant B. Great, right? But what if it's just random chance? What if next week the result is reversed?
Statistical significance tells you: How confident can we be that the result isn't random?
Without proper statistics:
- ❌ You implement changes that don't work
- ❌ You reject changes that do work
- ❌ You waste time and resources
- ❌ You lose trust in data
Core Concepts (No Math Required)
P-value: Probability of Chance
What it is: P-value tells you the probability of seeing such a (or larger) difference purely by chance, if in reality no difference existed.
Intuitive explanation:
- p = 0.05 means: "There's a 5% chance this result is random"
- p = 0.01 means: "There's a 1% chance this result is random"
Thresholds:
| P-value | Interpretation | Recommendation |
|---|---|---|
| p < 0.01 | Highly significant | Very confident result |
| p < 0.05 | Significant | Standard threshold |
| p < 0.10 | Marginally significant | Be cautious |
| p ≥ 0.10 | Not significant | Cannot conclude |
Confidence Interval: Range of Possible Values
What it is: The range where the true value likely falls.
Example:
- Result: +5% conversion, 95% CI: [2%, 8%]
- Means: "We're 95% confident the true effect is between +2% and +8%"
Why it matters:
- If CI contains 0, result is not significant
- Narrow CI = more precise estimate
- Wide CI = you need more data
Statistical Power: Ability to Detect Effect
What it is: Probability that the test detects a real effect if it exists.
Why 80% power is standard:
- 80% power = 80% chance of detecting a real effect
- 20% chance you'll miss the effect (false negative)
Trade-off:
| Power | Sample size | Risk |
|---|---|---|
| 70% | Smaller | 30% false negatives |
| 80% | Medium | 20% false negatives (standard) |
| 90% | Larger | 10% false negatives |
Sample Size: How Many Users You Need
Main factors:
- Baseline conversion rate — lower = need more data
- Minimum detectable effect (MDE) — smaller effect = need more data
- Desired power — higher power = need more data
- Significance level — lower p-value = need more data
Practical table (80% power, p<0.05):
| Baseline CR | MDE 10% relative | MDE 5% relative |
|---|---|---|
| 1% | ~30,000/variant | ~120,000/variant |
| 5% | ~6,000/variant | ~25,000/variant |
| 10% | ~3,000/variant | ~12,000/variant |
| 20% | ~1,500/variant | ~6,000/variant |
Practical Applications
When to End a Test?
Never end a test early just because you see a significant result!
Decision framework:
1. Has test reached planned sample size?
→ NO: Wait (even if result is significant)
→ YES: Continue to step 2
2. Is result statistically significant (p < 0.05)?
→ YES: Implement winner
→ NO: Continue to step 3
3. Is the test practically significant?
→ CI is narrow and close to 0: Probably no effect
→ CI is wide: Need more data or larger MDE
How to Interpret Results
Scenario 1: Significant positive result
- Result: +8%, p=0.02, CI [3%, 13%]
- Interpretation: ✅ Implement the change
Scenario 2: Non-significant result, narrow CI
- Result: +1%, p=0.45, CI [-2%, 4%]
- Interpretation: Probably no meaningful effect, can reject
Scenario 3: Non-significant result, wide CI
- Result: +5%, p=0.15, CI [-2%, 12%]
- Interpretation: Inconclusive — need more data
5 Most Common Mistakes
❌ Mistake 1: Peeking Problem
Problem: Check results daily and stop when you see significance. Consequence: Up to 30% false positives! Solution: Define sample size upfront and don't change it.
❌ Mistake 2: Multiple Comparisons
Problem: Test 10 variants and celebrate the one significant one. Consequence: With 10 variants, ~40% chance of false positive. Solution: Bonferroni correction or single primary metric.
❌ Mistake 3: Underpowered Tests
Problem: Test with too small sample size. Consequence: Most real effects remain undetected. Solution: Sample size calculation upfront.
❌ Mistake 4: Ignoring Effect Size
Problem: Focus only on p-value, not effect magnitude. Consequence: +0.1% can be "significant" with enough data. Solution: Always look at CI and practical significance.
❌ Mistake 5: P-hacking
Problem: Try different segments and metrics until you find significance. Consequence: False discoveries. Solution: Pre-registration of hypotheses, transparent reporting.
Tools and Calculators
| Tool | Purpose | Link |
|---|---|---|
| Evan Miller Calculator | Sample size | evanmiller.org |
| AB Test Guide | Duration | abtestguide.com |
| VWO Calculator | Significance | vwo.com |
| Optimizely Stats Engine | Sequential testing | Optimizely docs |
Conclusion
Statistical significance isn't about perfect math — it's about reducing the risk of bad decisions. Remember:
- p < 0.05 is standard, not absolute truth
- Sample size — calculate upfront
- Never cheat — peeking and p-hacking invalidate your tests
- Effect size matters — even statistically significant results can be practically meaningless
Action steps:
- Set up pre-registration process for experiments
- Use sample size calculator before every test
- Define stopping rules upfront
- Always report CI, not just p-value