Tuning bandit algorithms in stochastic environments - PowerPoint PPT Presentation

About This Presentation
Title:

Tuning bandit algorithms in stochastic environments

Description:

Title: Exploration vs. Exploitation: Towards Efficient Algorithms Author: Csaba Szepesvari Last modified by: Csaba Szepesvari Created Date: 11/22/2006 6:41:00 PM – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 23
Provided by: CsabaSze4
Category:

less

Transcript and Presenter's Notes

Title: Tuning bandit algorithms in stochastic environments


1
Tuning bandit algorithms in stochastic
environments
  • Jean-Yves Audibert, CERTIS - Ecole des Ponts
  • Remi Munos, INRIA Futurs Lille
  • Csaba Szepesvári, University of Alberta
  • The 18th International Conference on Algorithmic
    Learning Theory
  • October 3, 2007, Sendai International Center,
    Sendai, Japan

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAAAAAA
2
Contents
  • Bandit problems
  • UCB and Motivation
  • Tuning UCB by using variance estimates
  • Concentration of the regret
  • Finite horizon finite regret (PAC-UCB)
  • Conclusions

3
Exploration vs. Exploitation
  • Two treatments
  • Unknown success probabilities
  • Goal
  • find the best treatment while losing the smallest
    number of patients
  • Explore or exploit?

4
Playing Bandits
  • Payoff is 0 or 1
  • Arm 1
  • X11, X12, X13, X14, X15, X16, X17,
  • Arm 2
  • X21, X22, X23, X24, X25, X26, X27,

0
1
0
0
1
1
0
1
1
1
5
Exploration vs. Exploitation Some Applications
  • Simple processes
  • Clinical trials
  • Job shop scheduling (random jobs)
  • What ad to put on a web-page
  • More complex processes (memory)
  • Optimizing production
  • Controlling an inventory
  • Optimal investment
  • Poker

6
Bandit Problems Optimism in the Face of
Uncertainty
  • Introduced by Lai and Robbins (1985) (?)
  • i.i.d. payoffs
  • X11,X12,,X1t,
  • X21,X22,,X2t,
  • Principle
  • Inflated value of an option maximum expected
    reward that looks quite possible given the
    observations so far
  • Select the option with best inflated value

7
Some definitions
Now t11 T1(t-1) 4 T2(t-1) 6 I1 1, I2 2,
  • Payoff is 0 or 1
  • Arm 1
  • X11, X12, X13, X14, X15, X16, X17,
  • Arm 2
  • X21, X22, X23, X24, X25, X26, X27,

0
1
0
0
1
1
0
1
1
1
8
Parametric Bandits LaiRobbins
  • Xit pi,?i(), ?i unknown, t1,2,
  • Uncertainty setReasonable values of ? given
    the experience so farUi,t? pi,?
    (Xi,1Ti(t)) is large mod (t,Ti(t))
  • Inflated values Zi,tmax E? ?2 Ui,t
  • Rule It arg maxi Zi,t

9
Bounds
  • Upper bound
  • Lower boundIf an algorithm is uniformly good
    then..

10
UCB1 Algorithm (Auer et al., 2002)
  • Algorithm UCB1(b)
  • Try all options once
  • Use option k with the highest index
  • Regret bound
  • Rn Expected loss due to not selecting the best
    option at time step n. Then

11
Problem 1
  • When b2À ?2, regret should scale with ?2 and not
    b2!

12
UCB1-NORMAL
  • Algorithm UCB1-NORMAL
  • Try all options once
  • Use option k with the highest index
  • Regret bound

13
Problem 1
  • The regret of UCB1(b) scales with O(b2)
  • The regret of UCB1-NORMAL scales with O(?2)
  • but UCB1-NORMAL assumes normally distributed
    payoffs
  • UCB-Tuned(b)
  • Good experimental results
  • No theoretical guarantees

14
UCB-V
  • Algorithm UCB-V(b)
  • Try all options once
  • Use option k with the highest index
  • Regret bound

15
Proof
  • The missing bound (hunch.net)
  • Bounding the sampling times of suboptimal arms
    (new bound)

16
Can we decrease exploration?
  • Algorithm UCB-V(b,?,c)
  • Try all options once
  • Use option k with the highest index
  • Theorem
  • When ?lt1, the regret will be polynomial for some
    bandit problems
  • When c?lt1/6, the regret will be polynomial for
    some bandit problems

17
Concentration bounds
  • Averages concentrate
  • Does the regret of UCB concentrate?

RISK??
18
Logarithmic regret implies high risk
  • TheoremConsider the pseudo-regret Rn ?k1K
    Tk(n) ?k.
  • Then for any ?gt1 and zgt? log(n), P(Rngtz)
    C z-?
  • (Gaussian tailP(Rngtz) C exp(-z2))
  • Illustration
  • Two arms ?2 ?2-?1gt0.
  • Modes of law of Rn at O(log(n)), O(?2n)!

Only happens when the support of the second
bestarms distribution overlaps with that of the
optimal arm
19
Finite horizon PAC-UCB
  • Algorithm PAC-UCB(N)
  • Try all options ones
  • Use option k with the highest index
  • Theorem
  • At time N with probability 1-1/N, suboptimal
    plays are bounded by O(log(K N)).
  • Good when N is known beforehand

20
Conclusions
  • Taking into account the variance lessens
    dependence on the a priori bound b
  • Low expected regret gt high risk
  • PAC-UCB
  • Finite regret, known horizon, exponential
    concentration of the regret
  • Optimal balance? Other algorithms?
  • Greater generality look up the paper!

21
Thank you!
  • Questions?

22
References
  • Optimism in the face of uncertainty Lai, T. L.
    and Robbins, H. (1985). Asymptotically efficient
    adaptive allocation rules. Advances in Applied
    Mathematics, 6422.
  • UCB1 and more Auer, P., Cesa-Bianchi, N., and
    Fischer, P. (2002). Finite time analysis of the
    multiarmed bandit problem. Machine Learning,
    47(2-3)235256.
  • Audibert, J., Munos, R., and Szepesvári, Cs.
    (2007). Tuning bandit algorithms in stochastic
    environments, ALT-2007.
Write a Comment
User Comments (0)
About PowerShow.com