Efficient multi-armed bandit with Thompson sampling for applications with delayed feedback
Decision making often struggles with the exploration-exploitation dilemma. Multi-armed bandits (MAB) are a popular reinforcement learning solution, but increasing the number of decision criteria leads to an exponential blowup in complexity, and observational delays dont allow for optimal performance. Shradha Agrawal offers an overview of MABs and explains how to overcome the above challenges.
Talk Title | Efficient multi-armed bandit with Thompson sampling for applications with delayed feedback |
Speakers | Shradha Agrawal (Adobe) |
Conference | Strata Data Conference |
Conf Tag | Big Data Expo |
Location | San Francisco, California |
Date | March 26-28, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The exploration-exploitation trade-off is a fundamental dilemma in online decision making. Reinforcement learning (RL) approaches are often employed to achieve optimal outcomes. Multi-armed bandits (MAB) are popular RL algorithms tailored for tackling the exploration-exploitation trade-off. However, increasing the number of arms (i.e., decision criteria) leads to exponential increase in complexity. Multi-armed bandits need a fast feedback loop to be able to improve their policy decisions and converge to the optimal solution, but delayed feedback is common in many applications—for example, in advertising, information about conversion would be available long after the advertisement was displayed. Shradha Agrawal offers an overview of MABs and explains how to efficiently scale to multiple decision criteria. Shradha focuses on the Thompson sampling technique, which uses randomization effectively to handle observational delays—using an example from advertising to show how the solution can be used to provide relevant and personalized experiences to users in real-time to increase conversions.