Tuning bandit algorithms in stochastic environments

About This Presentation

Title:

Tuning bandit algorithms in stochastic environments

Description:

Title: Exploration vs. Exploitation: Towards Efficient Algorithms Author: Csaba Szepesvari Last modified by: Csaba Szepesvari Created Date: 11/22/2006 6:41:00 PM – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 23

Provided by: CsabaSze4

Category:

more less

Transcript and Presenter's Notes

Title: Tuning bandit algorithms in stochastic environments

1
Tuning bandit algorithms in stochastic
environments

Jean-Yves Audibert, CERTIS - Ecole des Ponts
Remi Munos, INRIA Futurs Lille
Csaba Szepesvári, University of Alberta

The 18th International Conference on Algorithmic
Learning Theory
October 3, 2007, Sendai International Center,
Sendai, Japan

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAAAAAA
2
Contents

Bandit problems
UCB and Motivation
Tuning UCB by using variance estimates
Concentration of the regret
Finite horizon finite regret (PAC-UCB)
Conclusions

3
Exploration vs. Exploitation

Two treatments
Unknown success probabilities
Goal
find the best treatment while losing the smallest
number of patients
Explore or exploit?

4
Playing Bandits

Payoff is 0 or 1
Arm 1
X11, X12, X13, X14, X15, X16, X17,
Arm 2
X21, X22, X23, X24, X25, X26, X27,

0
1
0
0
1
1
0
1
1
1
5
Exploration vs. Exploitation Some Applications

Simple processes
Clinical trials
Job shop scheduling (random jobs)
What ad to put on a web-page
More complex processes (memory)
Optimizing production
Controlling an inventory
Optimal investment
Poker

6
Bandit Problems Optimism in the Face of
Uncertainty

Introduced by Lai and Robbins (1985) (?)
i.i.d. payoffs
X11,X12,,X1t,
X21,X22,,X2t,
Principle
Inflated value of an option maximum expected
reward that looks quite possible given the
observations so far
Select the option with best inflated value

7
Some definitions
Now t11 T1(t-1) 4 T2(t-1) 6 I1 1, I2 2,

Payoff is 0 or 1
Arm 1
X11, X12, X13, X14, X15, X16, X17,
Arm 2
X21, X22, X23, X24, X25, X26, X27,

0
1
0
0
1
1
0
1
1
1
8
Parametric Bandits LaiRobbins

Xit pi,?i(), ?i unknown, t1,2,
Uncertainty setReasonable values of ? given
the experience so farUi,t? pi,?
(Xi,1Ti(t)) is large mod (t,Ti(t))
Inflated values Zi,tmax E? ?2 Ui,t
Rule It arg maxi Zi,t

9
Bounds

Upper bound
Lower boundIf an algorithm is uniformly good
then..

10
UCB1 Algorithm (Auer et al., 2002)

Algorithm UCB1(b)
Try all options once
Use option k with the highest index
Regret bound
Rn Expected loss due to not selecting the best
option at time step n. Then

11
Problem 1

When b2À ?2, regret should scale with ?2 and not
b2!

12
UCB1-NORMAL

Algorithm UCB1-NORMAL
Try all options once
Use option k with the highest index
Regret bound

13
Problem 1

The regret of UCB1(b) scales with O(b2)
The regret of UCB1-NORMAL scales with O(?2)
but UCB1-NORMAL assumes normally distributed
payoffs
UCB-Tuned(b)
Good experimental results
No theoretical guarantees

14
UCB-V

Algorithm UCB-V(b)
Try all options once
Use option k with the highest index
Regret bound

15
Proof

The missing bound (hunch.net)
Bounding the sampling times of suboptimal arms
(new bound)

16
Can we decrease exploration?

Algorithm UCB-V(b,?,c)
Try all options once
Use option k with the highest index
Theorem
When ?lt1, the regret will be polynomial for some
bandit problems
When c?lt1/6, the regret will be polynomial for
some bandit problems

17
Concentration bounds

Averages concentrate
Does the regret of UCB concentrate?

RISK??
18
Logarithmic regret implies high risk

TheoremConsider the pseudo-regret Rn ?k1K
Tk(n) ?k.
Then for any ?gt1 and zgt? log(n), P(Rngtz)
C z-?
(Gaussian tailP(Rngtz) C exp(-z2))
Illustration
Two arms ?2 ?2-?1gt0.
Modes of law of Rn at O(log(n)), O(?2n)!

Only happens when the support of the second
bestarms distribution overlaps with that of the
optimal arm
19
Finite horizon PAC-UCB

Algorithm PAC-UCB(N)
Try all options ones
Use option k with the highest index
Theorem
At time N with probability 1-1/N, suboptimal
plays are bounded by O(log(K N)).
Good when N is known beforehand

20
Conclusions

Taking into account the variance lessens
dependence on the a priori bound b
Low expected regret gt high risk
PAC-UCB
Finite regret, known horizon, exponential
concentration of the regret
Optimal balance? Other algorithms?
Greater generality look up the paper!

21
Thank you!

Questions?

22
References

Optimism in the face of uncertainty Lai, T. L.
and Robbins, H. (1985). Asymptotically efficient
adaptive allocation rules. Advances in Applied
Mathematics, 6422.
UCB1 and more Auer, P., Cesa-Bianchi, N., and
Fischer, P. (2002). Finite time analysis of the
multiarmed bandit problem. Machine Learning,
47(2-3)235256.
Audibert, J., Munos, R., and Szepesvári, Cs.
(2007). Tuning bandit algorithms in stochastic
environments, ALT-2007.