Date
Thu, 03 Dec 2020
Time
16:00 - 17:00
Speaker
Tanut Treetanthiploet
Organisation
University of Oxford

Abstract: In many situations, one needs to decide between acting to reveal data about a system and acting to generate profit; this is the trade-off between exploration and exploitation. A simple situation where we face this trade-off is a multiarmed bandit problem, where one has M ‘bandits’ which generate reward from an unknown distribution, and one must choose which bandit to play at each time. The key difficulty in the multi-armed bandit problem is that the action often affects the information obtained. Due to the curse of dimensionality, solving the bandit problem directly is often computationally intractable.

In this talk, we will formulate a general class of the multi-armed bandit problem as a relaxed stochastic control problem. By introducing an entropy premium, we obtain a smooth asymptotic approximation to the value function. This yields a novel semi-index approximation of the optimal decision process, obtained numerically by solving a fixed point problem, which can be interpreted as explicitly balancing an exploration–exploitation trade-off.  Performance of the resulting Asymptotic Randomised Control (ARC) algorithm compares favourably with other approaches to correlated multi-armed bandits.

As an application of the multi-armed bandit, we also consider a multi-armed bandit problem where the observation from each bandit arrive from a Generalised Linear Model. We then use such model to consider a dynamic online pricing problem. The numerical simulation shows that the ARC algorithm also performs well compared to others.
============

Please contact us with feedback and comments about this page. Last updated on 03 Apr 2022 01:32.