Who is this guide for?

This guide is designed for beginner-level users and takes about 1 minutes to read.

Best Practice Beginner 1 min read 236 words

Statistics for A/B Testing: Confidence Intervals and Sample Sizes

Proper statistical methodology is essential for valid A/B test results. This guide covers confidence intervals, sample size calculation, and common statistical mistakes that lead to false conclusions.

Key Takeaways

Without proper statistical rigor, A/B tests can lead to false conclusions.
The p-value is the probability of seeing results at least as extreme as the observed data, assuming there's no real difference.
Before starting a test, calculate the required sample size.

Featured Tool

Percentage Calculator

Calculate percentages, increases, decreases, and ratios

Try it Free

Why Statistics Matter for A/B Tests

Without proper statistical rigor, A/B tests can lead to false conclusions. A conversion rate increase might be random variation, not a real improvement. Statistical tests quantify the probability of being wrong.

Key Concepts

P-Value

The p-value is the probability of seeing results at least as extreme as the observed data, assuming there's no real difference. A p-value below 0.05 is conventionally considered statistically significant.

Confidence Interval

A 95% confidence interval means that if you repeated the experiment many times, 95% of the intervals would contain the true value. Wider intervals indicate less certainty.

Statistical Power

Power is the probability of detecting a real effect when one exists. An underpowered test (typically below 80% power) may miss real improvements.

Sample Size Calculation

Before starting a test, calculate the required sample size. Key inputs:

Baseline conversion rate: Your current rate (e.g., 3%).
Minimum detectable effect: The smallest improvement worth detecting (e.g., 10% relative).
Statistical significance: Usually 95% (alpha = 0.05).
Statistical power: Usually 80% (beta = 0.20).

Common Mistakes

Peeking: Checking results before reaching the required sample size inflates false positive rates.
Multiple comparisons: Testing many metrics simultaneously without correction.
Stopping early: Ending a test as soon as results look significant.
Ignoring segments: An overall neutral result may hide positive and negative effects in different user segments.

Categories