Why is stopping an A/B test at first significance problematic?

Curated by the Tezvyn teamJune 16, 2026Source: docs.growthbook.iointermediate

Tests peeking and Type I error inflation. Name peeking; explain daily looks inflate false positive rates above nominal alpha; note p-values assume one look at fixed sample size; recommend pre-committed runtimes or sequential testing.

WHAT THIS TESTS: Whether the candidate understands that frequentist statistical inference in A/B testing relies on a strict experimental protocol. Specifically, it tests if they know that repeatedly inspecting results and stopping at the first significant p-value invalidates the false positive rate guarantee. The interviewer wants to see familiarity with peeking, optional stopping, and why p-values are only valid under a pre-specified sample size and analysis plan.

A GOOD ANSWER COVERS: Four things in order. First, name the issue: this is the peeking problem, also called optional stopping. Second, explain the mechanism: every time you look at the data, you get another chance to observe a statistically significant result by random noise alone. Under a standard 5% significance level, a single analysis has a 5% false positive rate, but checking daily for 20 days can push the cumulative Type I error rate to roughly 20-30% or more depending on correlation across days. Third, clarify the statistical foundation: frequentist p-values and confidence intervals are computed assuming one look at the end of the experiment with a fixed sample size. Breaking that assumption means the reported p-value is no longer the true probability of seeing such an extreme result by chance. Fourth, offer a solution: pre-commit to a sample size and runtime before launch, do not stop early for significance, and if interim looks are truly necessary, use a sequential testing framework or alpha spending function to preserve the overall error rate.

COMMON WRONG ANSWERS: Calling the issue generic p-hacking without distinguishing the specific repeated-testing mechanism of peeking. Saying the problem is simply that the sample size is too small, which misses the point that even a large sample can yield false positives if you repeatedly test it. Suggesting that switching to Bayesian methods automatically eliminates all peeking concerns, which is an oversimplification because Bayesian inference can still suffer from decision-theoretic peeking bias depending on priors and loss functions. Proposing to lower the alpha threshold arbitrarily without a formal multiple comparison correction.

LIKELY FOLLOW-UPS: How would you design an experiment that needs interim results for business reasons? What is a sequential probability ratio test or group sequential design? How does peeking differ from the multiple testing problem across many metrics? When might Bayesian updating be a reasonable alternative?

ONE CONCRETE EXAMPLE: Imagine a 30-day A/B test with a 5% significance level. If the product manager checks the dashboard every morning and plans to ship the winning variant the moment the p-value drops below 0.05, the actual probability of declaring a winner when there is no true effect is no longer 5%. Simulations show that with daily independent looks, the false positive rate can exceed 25%. If the manager instead pre-commits to running the full 30 days and only evaluates once at the end, the false positive rate stays at the intended 5%.

Source: GrowthBook Docs

Read the original → docs.growthbook.io

#ab testing
#peeking
#statistics
#type i error
#experimentation

Get five bites like this every day.

Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.

Get on Play Store Get on App Store