In a robust decision, we are pessimistic toward our decision making when the probability measure is unknown. In particular, we optimise our decision under the worst case scenario (e.g. via value at risk or expected shortfall). On the other hand, most theories in reinforcement learning (e.g. UCB or epsilon-greedy algorithm) tell us to be more optimistic in order to encourage learning. These two approaches produce an apparent contradict in decision making. This raises a natural question. How should we make decisions, given they will affect our short-term outcomes, and information available in the future?
In this talk, I will discuss this phenomenon through the classical multi-armed bandit problem which is known to be solved via Gittins' index theory under the setting of risk (i.e. when the probability measure is fixed). By extending this result to an uncertainty setting, we can show that it is possible to take into account both uncertainty and learning for a future benefit at the same time. This can be done by extending a consistent nonlinear expectation (i.e. nonlinear expectation with tower property) through multiple filtrations.
At the end of the talk, I will present numerical results which illustrate how we can control our level of exploration and exploitation in our decision based on some parameters.
- Junior Applied Mathematics Seminar