Architect a real-time multi-armed bandit and compare trade-offs to A/B testing

Curated by the Tezvyn teamJune 17, 2026Source: optimizely.comadvanced

WHAT IT TESTS

Real-time ML serving and statistical trade-offs.

ANSWER OUTLINE

Sketch a fast arm router, streaming feedback, and model updates; contrast MAB regret minimization with A/B's unbiased estimates.

WHAT THIS TESTS: This question probes whether you can translate a statistical concept into a live production architecture. Interviewers want to see that you understand exploration versus exploitation not just as math but as latency budgets, data pipelines, and operational constraints. They also want to know if you grasp why A/B testing remains the gold standard for causal inference despite its inefficiency.

A GOOD ANSWER COVERS: First, a low-latency assignment service, typically at the edge or behind a CDN, that selects an arm in under 100 milliseconds using an epsilon-greedy, upper confidence bound, or Thompson Sampling policy. Second, a feedback pipeline that captures reward signals like clicks or conversions and feeds them into a feature store or streaming aggregator. Third, a model update loop that refreshes arm probabilities, which can be near-real-time for simple counts or periodic for Bayesian updates. Fourth, the core trade-off: MAB minimizes regret by sending more traffic to leading variants early, but it corrupts the fixed-sample assumptions required for classical hypothesis testing, so you lose unbiased estimates of lift. Fifth, operational guardrails such as an exploration floor, usually 5 to 10 percent, plus a fallback to static A/B if convergence is too slow or if the business needs a clean read for a quarterly review.

COMMON WRONG ANSWERS: Treating MAB as a drop-in replacement with no engineering cost is a major red flag. Another is proposing daily batch retraining for a system described as real-time, which introduces stale arm probabilities and defeats the purpose. Candidates also err by claiming MAB delivers the same statistical rigor as A/B testing; adaptive allocation creates bias that makes post-hoc significance testing invalid. Finally, ignoring cold start, when a new arm enters with zero data, shows shallow understanding.

LIKELY FOLLOW-UPS: How would you add user context to move from a standard bandit to a contextual bandit? What happens when a new copy variant is introduced mid-campaign? How do you prevent a single high-value user from skewing the reward distribution? Would you use an off-the-shelf Bayesian framework or build a custom counter service?

ONE CONCRETE EXAMPLE: Imagine a landing page with four headline variants. In an A/B test, each gets 25 percent of traffic for two weeks until a winner is declared, costing conversions on the weaker three. In a Thompson Sampling MAB, the assignment service starts with uniform Beta priors. After 1000 impressions, variant A shows a 5 percent click rate while the others show 2 percent. The algorithm begins serving A to roughly 70 percent of traffic while keeping a 10 percent exploration reserve for the remaining arms. Over a month, total conversions rise by 12 percent compared to the fixed split, but you cannot run a standard t-test at the end to prove it because the sample sizes are adaptively biased.

Source: optimizely.com

Read the original → optimizely.com

#experimentation
#machine-learning
#system-design
#ab-testing
#real-time

Get five bites like this every day.

Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.

Get on Play Store Get on App Store