December 25, 2019

246 words 2 mins read

Efficient multi-armed bandit with Thompson sampling for applications with delayed feedback

Efficient multi-armed bandit with Thompson sampling for applications with delayed feedback

Decision making often struggles with the exploration-exploitation dilemma. Multi-armed bandits (MAB) are a popular reinforcement learning solution, but increasing the number of decision criteria leads to an exponential blowup in complexity, and observational delays dont allow for optimal performance. Shradha Agrawal offers an overview of MABs and explains how to overcome the above challenges.

Talk Title Efficient multi-armed bandit with Thompson sampling for applications with delayed feedback
Speakers Shradha Agrawal (Adobe)
Conference Strata Data Conference
Conf Tag Big Data Expo
Location San Francisco, California
Date March 26-28, 2019
URL Talk Page
Slides Talk Slides
Video

The exploration-exploitation trade-off is a fundamental dilemma in online decision making. Reinforcement learning (RL) approaches are often employed to achieve optimal outcomes. Multi-armed bandits (MAB) are popular RL algorithms tailored for tackling the exploration-exploitation trade-off. However, increasing the number of arms (i.e., decision criteria) leads to exponential increase in complexity. Multi-armed bandits need a fast feedback loop to be able to improve their policy decisions and converge to the optimal solution, but delayed feedback is common in many applications—for example, in advertising, information about conversion would be available long after the advertisement was displayed. Shradha Agrawal offers an overview of MABs and explains how to efficiently scale to multiple decision criteria. Shradha focuses on the Thompson sampling technique, which uses randomization effectively to handle observational delays—using an example from advertising to show how the solution can be used to provide relevant and personalized experiences to users in real-time to increase conversions.

comments powered by Disqus