How do you determine sample size and duration for an A/B test?

Curated by the Tezvyn teamJune 16, 2026Source: cxl.comintermediate

This tests statistical power literacy. A strong answer names baseline rate, MDE, alpha, and beta; explains the duration versus sensitivity trade-off; and notes traffic allocation. A red flag is ignoring power or stopping early when results look significant.

WHAT THIS TESTS: This question probes whether you understand that an A/B test is a controlled experiment, not a dashboard you watch until the numbers feel good. Interviewers want to see that you treat sample size and duration as derived quantities rooted in statistical power, not arbitrary calendar events. They also care whether you recognize real world constraints like traffic volume, business cycles, and the cost of wrong decisions.

A GOOD ANSWER COVERS: First, name the four inputs to a standard power analysis. These are the baseline conversion rate or mean, the Minimum Detectable Effect you want to be able to spot, the significance level alpha usually set at five percent, and statistical power typically eighty percent. Second, explain that sample size per variant is what the formula yields, and duration equals total required visitors divided by daily eligible traffic. Third, discuss the core trade off. A smaller MDE or higher power inflates sample size and lengthens the test, while a larger MDE shrinks it but raises the risk of missing small yet valuable gains. Fourth, bring in practical guardrails. Mention running for full business cycles to capture weekday versus weekend behavior, avoiding peeking or optional stopping, and reserving a holdout if the business cannot afford a fifty-fifty split.

COMMON WRONG ANSWERS: The biggest red flag is saying you run the test for a week or two because that feels standard. Another is ignoring power entirely and claiming a large sample size alone guarantees validity. Some candidates confuse MDE with the observed lift, or suggest lowering alpha after the test starts to salvage significance. Stopping early because the p-value crossed the threshold mid experiment is a severe error that destroys the false positive rate.

LIKELY FOLLOW-UPS: An interviewer might ask how you would handle low traffic, which pushes you toward larger MDEs, sequential testing, or Bayesian methods with acceptable error bounds. They might also ask what you do if the test reaches the planned duration but the p-value is exactly point zero six, which tests whether you understand that the protocol should have been locked beforehand.

ONE CONCRETE EXAMPLE: Suppose your checkout page converts at ten percent baseline and you want to detect a relative fifteen percent uplift, which is an absolute one point five percentage point lift. Using a two tailed test at alpha equals point zero five and power of point eight, you need roughly six thousand two hundred visitors per variant, or about twelve thousand four hundred total. If your site sees one thousand eligible visitors per day and you run a fifty-fifty split, you will need roughly twenty five days. If the business demands results in two weeks, you must either accept a higher MDE, lower power, or run on a higher traffic page.

Read the original → cxl.com

#ab testing
#statistics
#experimentation
#sample size
#power analysis

Get five bites like this every day.

Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.

Get on Play Store Get on App Store