Statistical Significance Demystified: What Growth Managers Need to Know

A/B test statistics don't have to be a nightmare. This guide explains key concepts without complex math — just what you actually need for sound decisions.

Why Statistics Matter

Imagine your A/B test shows +5% conversion rate for variant B. Great, right? But what if it's just random chance? What if next week the result is reversed?

Statistical significance tells you: How confident can we be that the result isn't random?

Without proper statistics:

❌ You implement changes that don't work
❌ You reject changes that do work
❌ You waste time and resources
❌ You lose trust in data

Core Concepts (No Math Required)

P-value: Probability of Chance

What it is: P-value tells you the probability of seeing such a (or larger) difference purely by chance, if in reality no difference existed.

Intuitive explanation:

p = 0.05 means: "There's a 5% chance this result is random"
p = 0.01 means: "There's a 1% chance this result is random"

Thresholds:

P-value	Interpretation	Recommendation
p < 0.01	Highly significant	Very confident result
p < 0.05	Significant	Standard threshold
p < 0.10	Marginally significant	Be cautious
p ≥ 0.10	Not significant	Cannot conclude

Confidence Interval: Range of Possible Values

What it is: The range where the true value likely falls.

Example:

Result: +5% conversion, 95% CI: [2%, 8%]
Means: "We're 95% confident the true effect is between +2% and +8%"

Why it matters:

If CI contains 0, result is not significant
Narrow CI = more precise estimate
Wide CI = you need more data

Statistical Power: Ability to Detect Effect

What it is: Probability that the test detects a real effect if it exists.

Why 80% power is standard:

80% power = 80% chance of detecting a real effect
20% chance you'll miss the effect (false negative)

Trade-off:

Power	Sample size	Risk
70%	Smaller	30% false negatives
80%	Medium	20% false negatives (standard)
90%	Larger	10% false negatives

Sample Size: How Many Users You Need

Main factors:

Baseline conversion rate — lower = need more data
Minimum detectable effect (MDE) — smaller effect = need more data
Desired power — higher power = need more data
Significance level — lower p-value = need more data

Practical table (80% power, p<0.05):

Baseline CR	MDE 10% relative	MDE 5% relative
1%	~30,000/variant	~120,000/variant
5%	~6,000/variant	~25,000/variant
10%	~3,000/variant	~12,000/variant
20%	~1,500/variant	~6,000/variant

Practical Applications

When to End a Test?

Never end a test early just because you see a significant result!

Decision framework:

1. Has test reached planned sample size?
   → NO: Wait (even if result is significant)
   → YES: Continue to step 2

2. Is result statistically significant (p < 0.05)?
   → YES: Implement winner
   → NO: Continue to step 3

3. Is the test practically significant?
   → CI is narrow and close to 0: Probably no effect
   → CI is wide: Need more data or larger MDE

How to Interpret Results

Scenario 1: Significant positive result

Result: +8%, p=0.02, CI [3%, 13%]
Interpretation: ✅ Implement the change

Scenario 2: Non-significant result, narrow CI

Result: +1%, p=0.45, CI [-2%, 4%]
Interpretation: Probably no meaningful effect, can reject

Scenario 3: Non-significant result, wide CI

Result: +5%, p=0.15, CI [-2%, 12%]
Interpretation: Inconclusive — need more data

5 Most Common Mistakes

❌ Mistake 1: Peeking Problem

Problem: Check results daily and stop when you see significance. Consequence: Up to 30% false positives! Solution: Define sample size upfront and don't change it.

❌ Mistake 2: Multiple Comparisons

Problem: Test 10 variants and celebrate the one significant one. Consequence: With 10 variants, ~40% chance of false positive. Solution: Bonferroni correction or single primary metric.

❌ Mistake 3: Underpowered Tests

Problem: Test with too small sample size. Consequence: Most real effects remain undetected. Solution: Sample size calculation upfront.

❌ Mistake 4: Ignoring Effect Size

Problem: Focus only on p-value, not effect magnitude. Consequence: +0.1% can be "significant" with enough data. Solution: Always look at CI and practical significance.

❌ Mistake 5: P-hacking

Problem: Try different segments and metrics until you find significance. Consequence: False discoveries. Solution: Pre-registration of hypotheses, transparent reporting.

Tools and Calculators

Tool	Purpose	Link
Evan Miller Calculator	Sample size	evanmiller.org
AB Test Guide	Duration	abtestguide.com
VWO Calculator	Significance	vwo.com
Optimizely Stats Engine	Sequential testing	Optimizely docs

Conclusion

Statistical significance isn't about perfect math — it's about reducing the risk of bad decisions. Remember:

p < 0.05 is standard, not absolute truth
Sample size — calculate upfront
Never cheat — peeking and p-hacking invalidate your tests
Effect size matters — even statistically significant results can be practically meaningless

Action steps:

Set up pre-registration process for experiments
Use sample size calculator before every test
Define stopping rules upfront
Always report CI, not just p-value

Statistical Significance Demystified: What Growth Managers Need to Know

Statistical Significance Demystified: What Growth Managers Need to Know

Why Statistics Matter

Core Concepts (No Math Required)

P-value: Probability of Chance

Confidence Interval: Range of Possible Values

Statistical Power: Ability to Detect Effect

Sample Size: How Many Users You Need

Practical Applications

When to End a Test?

How to Interpret Results

5 Most Common Mistakes

❌ Mistake 1: Peeking Problem

❌ Mistake 2: Multiple Comparisons

❌ Mistake 3: Underpowered Tests

❌ Mistake 4: Ignoring Effect Size

❌ Mistake 5: P-hacking

Tools and Calculators

Conclusion

You might also like

Experimentation Velocity: How to Run 100+ Experiments Monthly

Behavioral Economics for Growth: Irrational Customer Decision-Making

6 Principles of Persuasion: Cialdini for Growth Marketing