Propose a multi-armed bandit system to optimize headlines faster

Curated by the Tezvyn teamJune 17, 2026Source: Wikipedia: Multi-armed banditadvanced

This tests online learning and the explore-exploit tradeoff. Answers contrast fixed A/B with adaptive allocation, sketch a Bayesian bandit service with a min exploration rate, and note delay.

WHAT THIS TESTS: This tests whether you understand the difference between fixed-allocation experimentation and online learning, and whether you can architect a production system that balances exploration and exploitation while handling delayed feedback and non-stationary rewards. Interviewers want to see that you know bandits minimize cumulative regret rather than just final inference error.

A GOOD ANSWER COVERS: First, the conceptual distinction: A/B testing uses a fixed split for a fixed period to estimate a treatment effect, while a bandit dynamically shifts traffic toward better arms based on accumulating evidence, reducing the opportunity cost of showing suboptimal headlines. Second, algorithm selection: name a principled approach such as Thompson Sampling, which maintains a posterior over each headline's click-through rate and samples from it to randomize exploration naturally, or UCB1, which optimistically biases uncertainty. Third, system architecture: a lightweight edge decision service that selects a variant per request and logs the impression; an async feedback pipeline that attributes clicks and updates posteriors; and a control plane that enforces an exploration floor, for example five to ten percent minimum traffic per arm, plus a kill switch. Fourth, practical safeguards: handle out-of-order delayed feedback gracefully; account for non-stationarity because headline performance decays as stories age, using a sliding window or decay on historical counts; and validate against a small holdout bucket running random allocation to verify cumulative reward improvement.

COMMON WRONG ANSWERS: Claiming bandits eliminate statistical rigor or sample size thinking. Proposing purely greedy allocation after a brief burn-in, which causes premature convergence. Ignoring real-time constraints by suggesting nightly batch updates. Failing to mention exploration floors, which lets the system starve new headlines before they prove themselves. Conflating standard bandits with contextual bandits without clarifying whether user features are actually used.

LIKELY FOLLOW-UPS: How do you handle non-stationary rewards when headline click-through rates decay over time? What if click feedback is delayed by hours and traffic has already shifted? How do you validate the bandit against a traditional A/B test? Would you use a contextual bandit for personalized headlines, and how does that change the system? How do you prevent a new headline from being starved during its first impressions?

ONE CONCRETE EXAMPLE: Imagine five headlines with Beta priors of one and one. For each request, sample a click-through rate from each posterior, serve the headline with the highest sample, and log the impression. On a click, increment alpha; otherwise increment beta after the attribution window. Enforce a hard floor of eight percent traffic per headline regardless of samples. After one day, the worst headline receives perhaps twelve percent of traffic instead of the twenty percent in a balanced A/B test, cutting regret significantly while still collecting enough data to detect a winner.

Source: Wikipedia: Multi-armed bandit

Read the original → Wikipedia: Multi-armed bandit

#multi-armed bandit
#ab testing
#experimentation
#system design
#online learning

Get five bites like this every day.

Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.

Get on Play Store Get on App Store