Statistical Significance Demystified: What Growth Managers Need to Know

Statistical Significance Demystified: What Growth Managers Need to Know

Statistical Significance Demystified: What Growth Managers Need to Know

A/B test statistics don't have to be a nightmare. This guide explains key concepts without complex math — just what you actually need for sound decisions.

Why Statistics Matter

Imagine your A/B test shows +5% conversion rate for variant B. Great, right? But what if it's just random chance? What if next week the result is reversed?

Statistical significance tells you: How confident can we be that the result isn't random?

Without proper statistics:

  • ❌ You implement changes that don't work
  • ❌ You reject changes that do work
  • ❌ You waste time and resources
  • ❌ You lose trust in data

Core Concepts (No Math Required)

P-value: Probability of Chance

What it is: P-value tells you the probability of seeing such a (or larger) difference purely by chance, if in reality no difference existed.

Intuitive explanation:

  • p = 0.05 means: "There's a 5% chance this result is random"
  • p = 0.01 means: "There's a 1% chance this result is random"

Thresholds:

P-valueInterpretationRecommendation
p < 0.01Highly significantVery confident result
p < 0.05SignificantStandard threshold
p < 0.10Marginally significantBe cautious
p ≥ 0.10Not significantCannot conclude

Confidence Interval: Range of Possible Values

What it is: The range where the true value likely falls.

Example:

  • Result: +5% conversion, 95% CI: [2%, 8%]
  • Means: "We're 95% confident the true effect is between +2% and +8%"

Why it matters:

  • If CI contains 0, result is not significant
  • Narrow CI = more precise estimate
  • Wide CI = you need more data

Statistical Power: Ability to Detect Effect

What it is: Probability that the test detects a real effect if it exists.

Why 80% power is standard:

  • 80% power = 80% chance of detecting a real effect
  • 20% chance you'll miss the effect (false negative)

Trade-off:

PowerSample sizeRisk
70%Smaller30% false negatives
80%Medium20% false negatives (standard)
90%Larger10% false negatives

Sample Size: How Many Users You Need

Main factors:

  1. Baseline conversion rate — lower = need more data
  2. Minimum detectable effect (MDE) — smaller effect = need more data
  3. Desired power — higher power = need more data
  4. Significance level — lower p-value = need more data

Practical table (80% power, p<0.05):

Baseline CRMDE 10% relativeMDE 5% relative
1%~30,000/variant~120,000/variant
5%~6,000/variant~25,000/variant
10%~3,000/variant~12,000/variant
20%~1,500/variant~6,000/variant

Practical Applications

When to End a Test?

Never end a test early just because you see a significant result!

Decision framework:

1. Has test reached planned sample size?
   → NO: Wait (even if result is significant)
   → YES: Continue to step 2

2. Is result statistically significant (p < 0.05)?
   → YES: Implement winner
   → NO: Continue to step 3

3. Is the test practically significant?
   → CI is narrow and close to 0: Probably no effect
   → CI is wide: Need more data or larger MDE

How to Interpret Results

Scenario 1: Significant positive result

  • Result: +8%, p=0.02, CI [3%, 13%]
  • Interpretation: ✅ Implement the change

Scenario 2: Non-significant result, narrow CI

  • Result: +1%, p=0.45, CI [-2%, 4%]
  • Interpretation: Probably no meaningful effect, can reject

Scenario 3: Non-significant result, wide CI

  • Result: +5%, p=0.15, CI [-2%, 12%]
  • Interpretation: Inconclusive — need more data

5 Most Common Mistakes

❌ Mistake 1: Peeking Problem

Problem: Check results daily and stop when you see significance. Consequence: Up to 30% false positives! Solution: Define sample size upfront and don't change it.

❌ Mistake 2: Multiple Comparisons

Problem: Test 10 variants and celebrate the one significant one. Consequence: With 10 variants, ~40% chance of false positive. Solution: Bonferroni correction or single primary metric.

❌ Mistake 3: Underpowered Tests

Problem: Test with too small sample size. Consequence: Most real effects remain undetected. Solution: Sample size calculation upfront.

❌ Mistake 4: Ignoring Effect Size

Problem: Focus only on p-value, not effect magnitude. Consequence: +0.1% can be "significant" with enough data. Solution: Always look at CI and practical significance.

❌ Mistake 5: P-hacking

Problem: Try different segments and metrics until you find significance. Consequence: False discoveries. Solution: Pre-registration of hypotheses, transparent reporting.

Tools and Calculators

ToolPurposeLink
Evan Miller CalculatorSample sizeevanmiller.org
AB Test GuideDurationabtestguide.com
VWO CalculatorSignificancevwo.com
Optimizely Stats EngineSequential testingOptimizely docs

Conclusion

Statistical significance isn't about perfect math — it's about reducing the risk of bad decisions. Remember:

  1. p < 0.05 is standard, not absolute truth
  2. Sample size — calculate upfront
  3. Never cheat — peeking and p-hacking invalidate your tests
  4. Effect size matters — even statistically significant results can be practically meaningless

Action steps:

  1. Set up pre-registration process for experiments
  2. Use sample size calculator before every test
  3. Define stopping rules upfront
  4. Always report CI, not just p-value

You might also like