A/B Testing in Practice: Complete Technical Guide for Product Teams
A/B Testing in Practice: Complete Technical Guide for Product Teams
A/B testing is a fundamental tool for every product team. But between "let's run an A/B test" and a truly valid experiment is a huge gap. Most teams make fundamental mistakes — from poorly formulated hypotheses through premature evaluation to ignoring statistical significance.
In this guide, I'll walk through the entire process from A to Z. Technically, practically, and without unnecessary simplification.
Why Most A/B Tests Fail
According to Optimizely data, only 10–20% of A/B tests bring a statistically significant positive result. That's not a problem — that's the reality of experimentation. The problem occurs when teams:
- Test without a clear hypothesis — "let's try a blue button" isn't a hypothesis
- Evaluate too early — the famous "peeking" problem
- Ignore sample size — a test on 200 users has no statistical power
- Test too many variants — diluting traffic among 5 variants
- Don't account for external factors — seasonality, marketing campaigns, outages
Let's go through how to do it right.
Step 1: Hypothesis Formulation
Every experiment starts with a clear, testable hypothesis. I use this format:
"If [change], then [expected result], because [justification based on data/insights]."
Examples of well-formulated hypotheses:
- "If we add a progress bar to onboarding flow, completion rate will increase by 15%, because session recordings show users leave when they don't know how many steps remain."
- "If we move CTA button above the fold on pricing page, CTR will rise by 10%, because heatmap shows 60% of users don't scroll below the fold."
Bad hypothesis: "We'll change button color and see what happens." — Missing justification and expected result.
Pro-tip: Before formulating a hypothesis, always conduct qualitative and quantitative research. The hypothesis should be based on data (analytics, heatmaps, user interviews), not intuition.
Step 2: Sample Size Calculation
This is the step most teams skip — and it's a fundamental mistake. You need to know how many users must go through the test to achieve statistically valid results.
Key parameters for calculation:
- Baseline conversion rate — current conversion (e.g., 5%)
- Minimum Detectable Effect (MDE) — smallest change you want to detect (e.g., 10% relative improvement = from 5% to 5.5%)
- Statistical significance level (α) — typically 95% (α = 0.05)
- Statistical power (1-β) — typically 80% (β = 0.20)
For baseline conversion rate of 5% and MDE of 10%, you need approximately 31,000 users per variant (so 62,000 total for an A/B test).
How to Calculate
Use one of these tools:
- Evan Miller's Calculator — simple and reliable online calculator
- Statsig's Power Calculator — advanced calculator with visualization
- Optimizely Stats Engine — automatic calculation directly in platform
Pro-tip: If you don't have sufficient traffic to detect small changes, increase MDE. It's better to reliably detect only large changes than to try catching small changes with insufficient statistical power.
Step 3: Experiment Design
Randomization and Segmentation
Proper randomization is the foundation of a valid experiment:
- User-level randomization — each user is assigned to one variant and stays in it throughout the test
- Sticky bucketing — ensure user always sees the same variant, even on repeated visits
- Exclude internal users — exclude employees and test accounts
Guardrail Metrics
Besides the primary metric, always track guardrail metrics — metrics that must not deteriorate:
- Page load time — deterioration of more than 100ms is unacceptable
- Error rate — experiment must not generate technical errors
- Revenue per user — even when testing engagement, you must not sacrifice revenue
Step 4: Launch and Monitoring
Rule Number One: Don't Peek
Peeking problem is the most common mistake in A/B testing. You check results every day and stop the test as soon as you see a "significant" result. Problem? With repeated hypothesis testing, false positive probability increases.
Solutions:
- Set test duration in advance based on sample size calculation
- Use sequential testing — methods like Always Valid P-values or Statsig's CUPED, designed for ongoing evaluation
- Minimum test duration — always at least 1 full business cycle (typically 1–2 weeks)
Common Pitfalls During Experiment Runtime
- Novelty effect — users react positively to novelty, but effect fades over time. Solution: Run test long enough (minimum 2–4 weeks)
- Day-of-week effect — user behavior differs weekdays vs. weekends. Solution: Test should run full weeks
- Simpson's paradox — aggregated data shows opposite trend than data in individual segments. Solution: Analyze results by segments too (device, country, user type)
Step 5: Results Analysis
Statistical Significance
To evaluate a test you need:
- P-value < 0.05 — probability that observed difference occurred by chance is less than 5%
- 95% Confidence Interval — range where true value lies with 95% probability
- Effect size — magnitude of observed effect (not just whether significant, but how large)
Example interpretation: "Variant B increased conversion rate by 8.3% (95% CI: 3.1%–13.5%, p = 0.002). Result is statistically significant and practically meaningful."
Segmentation Analysis
After overall analysis, always do breakdown by key segments:
- Device type — mobile vs. desktop results often differ dramatically
- New vs. returning users — new users react differently than existing ones
- Geography — cultural differences affect behavior
- Traffic source — organic vs. paid traffic has different patterns
A/B Testing Tool Comparison
The right tool choice depends on team size, technical capabilities, and budget.
LaunchDarkly
- Strength: Feature flags first, experiments second. Ideal for engineering-driven teams
- Weakness: Less advanced statistical methods
- Suitable for: Teams that need feature flags and experiments in one tool
Statsig
- Strength: Automatic CUPED adjustments, warehouse-native integration, excellent statistics
- Weakness: Steeper learning curve
- Suitable for: Data-driven product teams with their own data warehouse
GrowthBook
- Strength: Open-source, Bayesian and frequentist statistics, connection to own data
- Weakness: Requires more technical setup
- Suitable for: Teams with limited budget or need for full data control
Optimizely
- Strength: Most robust platform, Stats Engine eliminates peeking problem
- Weakness: Highest price, can be overengineered for small teams
- Suitable for: Enterprise teams with high traffic and budget
Multi-armed Bandits vs. Classic A/B Test
Classic A/B test evenly splits traffic between variants. Multi-armed bandit algorithms dynamically shift traffic to better-performing variants.
When to Use Classic A/B Test
- You need precise statistical conclusions
- You want to understand effect size with confidence interval
- Test runs long enough to achieve statistical power
When to Use Multi-armed Bandits
- Optimizing for revenue and every day with worse variant costs money
- You have many variants (more than 3–4)
- You don't need precise statistics, just want to find the "best" variant
- Typical use case: optimizing headlines, images, price points
Conclusion: A/B Testing Process Checklist
Before each experiment, go through this checklist:
- Hypothesis — is it clearly formulated with expected outcome and justification?
- Primary metric — is one key metric defined?
- Guardrail metrics — do you know what must not deteriorate?
- Sample size — do you have sufficient traffic for statistical power?
- Test duration — how many days/weeks will test run?
- Segments — what segments will you analyze?
- Success criteria — what must happen for you to implement the variant?
A/B testing is a science, not an art. Follow the process, respect statistics, and document learnings — even from unsuccessful experiments.