Learning and Optimisation in Bandit Problems

Seminar series

Mathematical and Computational Finance Internal Seminar

Date

Thu, 04 Nov 2010
13:00

Location

DH 1st floor SR

Speaker

Nathaniel Korda

An agent is presented with an N Bandit (Fruit) machines. It is assumed that each machine produces successes or failures according to some fixed, but unknown Bernoulli distribution. If the agent plays for ever, how can he/she choose a strategy that ensures the average successes observed tend to the parameter of the "best" arm?

Alternatively suppose that the agent recieves a reward of a^n at the nth button press for a success, and 0 for a failure; now how can the agent choose a strategy to optimise his/her total expected rewards over all time? These are two examples of classic Bandit Problems.

We analyse the behaviour of two strategies, the Narendra Algorithm and the Gittins Index Strategy. The Narendra Algorithm is a "learning"

strategy, in that it answers the first question in the above paragraph, and we demonstrate this remains true when the sequences of success and failures observed on the machines are no longer i.i.d., but merely satisfy an ergodic condition. The Gittins Index Strategy optimises the reward stream given above. We demonstrate that this strategy does not "learn" and give some new explicit bounds on the Gittins Indices themselves.