🍋
Menu
Best Practice Beginner 1 min read 236 words

Statistics for A/B Testing: Confidence Intervals and Sample Sizes

Proper statistical methodology is essential for valid A/B test results. This guide covers confidence intervals, sample size calculation, and common statistical mistakes that lead to false conclusions.

Key Takeaways

  • Without proper statistical rigor, A/B tests can lead to false conclusions.
  • The p-value is the probability of seeing results at least as extreme as the observed data, assuming there's no real difference.
  • Before starting a test, calculate the required sample size.

Why Statistics Matter for A/B Tests

Without proper statistical rigor, A/B tests can lead to false conclusions. A conversion rate increase might be random variation, not a real improvement. Statistical tests quantify the probability of being wrong.

Key Concepts

P-Value

The p-value is the probability of seeing results at least as extreme as the observed data, assuming there's no real difference. A p-value below 0.05 is conventionally considered statistically significant.

Confidence Interval

A 95% confidence interval means that if you repeated the experiment many times, 95% of the intervals would contain the true value. Wider intervals indicate less certainty.

Statistical Power

Power is the probability of detecting a real effect when one exists. An underpowered test (typically below 80% power) may miss real improvements.

Sample Size Calculation

Before starting a test, calculate the required sample size. Key inputs:

  • Baseline conversion rate: Your current rate (e.g., 3%).
  • Minimum detectable effect: The smallest improvement worth detecting (e.g., 10% relative).
  • Statistical significance: Usually 95% (alpha = 0.05).
  • Statistical power: Usually 80% (beta = 0.20).

Common Mistakes

  1. Peeking: Checking results before reaching the required sample size inflates false positive rates.
  2. Multiple comparisons: Testing many metrics simultaneously without correction.
  3. Stopping early: Ending a test as soon as results look significant.
  4. Ignoring segments: An overall neutral result may hide positive and negative effects in different user segments.

関連ツール

関連ガイド