Efficient multi-armed bandit with Thompson sampling for applications with delayed feedback

Decision making often struggles with the exploration-exploitation dilemma. Multi-armed bandits (MAB) are a popular reinforcement learning solution, but increasing the number of decision criteria leads to an exponential blowup in complexity, and observational delays dont allow for optimal performance. Shradha Agrawal offers an overview of MABs and explains how to overcome the above challenges.


Talk Title	Efficient multi-armed bandit with Thompson sampling for applications with delayed feedback
Speakers	Shradha Agrawal (Adobe)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Francisco, California
Date	March 26-28, 2019
URL	Talk Page
Slides	Talk Slides
Video

The exploration-exploitation trade-off is a fundamental dilemma in online decision making. Reinforcement learning (RL) approaches are often employed to achieve optimal outcomes. Multi-armed bandits (MAB) are popular RL algorithms tailored for tackling the exploration-exploitation trade-off. However, increasing the number of arms (i.e., decision criteria) leads to exponential increase in complexity. Multi-armed bandits need a fast feedback loop to be able to improve their policy decisions and converge to the optimal solution, but delayed feedback is common in many applications—for example, in advertising, information about conversion would be available long after the advertisement was displayed. Shradha Agrawal offers an overview of MABs and explains how to efficiently scale to multiple decision criteria. Shradha focuses on the Thompson sampling technique, which uses randomization effectively to handle observational delays—using an example from advertising to show how the solution can be used to provide relevant and personalized experiences to users in real-time to increase conversions.