Using upper confidence bounds to control exploration and exploitation - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Using upper confidence bounds to control exploration and exploitation

Description:

Using upper confidence bounds to control exploration and exploitation. Csaba Szepesv ri ... UCB1: Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002) ... – PowerPoint PPT presentation

Number of Views:172
Avg rating:3.0/5.0
Slides: 37
Provided by: csabasze6
Category:

less

Transcript and Presenter's Notes

Title: Using upper confidence bounds to control exploration and exploitation


1
Using upper confidence bounds to control
exploration and exploitation
  • Csaba Szepesvári
  • UofA
  • December 8, 2006
  • On-line Trading of Exploration and Exploitation
  • NIPS 2006 Workshop
  • szepesva_at_cs.ualberta.ca

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAAAAA
2
Contents
  • Bandit problems
  • Upper confidence based algorithms
  • Bandits in continuous time
  • Bandits with large action spaces
  • Conclusions

3
Exploration vs. Exploitation
  • Two treatments
  • Unknown success probabilities
  • Goal
  • find the best treatment while loosing the
    smallest number of patients
  • Explore or exploit?

4
Exploration vs. Exploitation Some Applications
  • Simple processes
  • Clinical trials
  • Job shop scheduling (random jobs)
  • What ad to put on a web-page
  • More complex processes (memory)
  • Optimizing production
  • Controlling an inventory
  • Optimal investment
  • Poker

5
Bandit Problems Optimism in the Face
Uncertainty
  • Introduced by Lai and Robbins (1985) (?)
  • Payoffs of two options are i.i.d.
  • X11,X12,,X1t,
  • X21,X22,,X2t,
  • Principle
  • Inflated value of an option maximum expected
    reward that looks quite possible given the
    observations so far
  • Select the option with best inflated value

6
Parametric Bandits LaiRobbins
  • Xit/ pi,?i(), ?i unknown, t1,2,
  • Uncertainty setReasonable values of ? given
    the experience so far Ui,t? P(Xi,1Ti(t)?)
    is large(t,Ti(t))
  • Inflated values Zi,tmax E??2 Ui,t
  • Rule It arg maxi Zi,t

7
Bounds
  • Upper bound
  • Lower boundIf an algorithm is uniformly good
    then..

8
UCB1 Algorithm (Auer et al., 2002)
  • Algorithm
  • Try all options ones
  • Use option j with the highest index (pgt2)
  • Regret bound
  • Rn Expected loss due to not selecting the best
    option at time step n. Then

9
Bandits in continuous time
András György Levente Kocsis Ivett Szabó
10
Bandits in Continuous Time
  • Task scheduling
  • Random tasks, finite types
  • Several proc. options
  • Until processing is done, no other task can be
    processed

11
Formal framework
  • i.i.d. covariates x1,x2,,xt, 2 X, Xlt1
  • rewards and delays (rit(x),?it(x))
    rmin rit(x) rmax ?min
    ?it(x) ?max ri(x) E ri1(x)
    ?i(x) E ?i1(x)

12
Evaluating allocation rules (policies)
  • It choice at time t
  • Reward at time t rt ri,It(xt)
  • Delay at time t ?t ?i,It(xt)
  • Gain of a policy

13
Gain, action values and regret
  • Optimal gain
  • Action values
  • Regret

14
Model-based UCB
  • Estimate p(x), ri(x), ?i(x)
  • Choose policy ut argmaxu ?u(p,r,?)
    (p,r,?)2 Mt
  • Problem Regret will depend on maxx 1/p(x) !
  • Idea Avoid estimating p!

15
Algorithm
16
Regret bound
  • Theorem

17
Key proposition
  • PropositionAssume that the following conditions
    are satisfiedThen

18
Open problems
  • Use variance estimates
  • Use average reward per step so-far
  • BurnetasKatehakis (MOR, 1997)
  • Inflate only the current action-value

19
..when the number of actions is
large
  • Levente Kocsis
  • Remi Munos

20
Bandits with large action-spaces
  • Problem
  • Bandit problem, 0 Xit 1, iid
  • Number of actions is large
  • Regret bound for UCB1
  • Scales badly with the number of suboptimal
    actions!
  • Example Planning in sequential problems
    action sequence of actions

21
Structure helps!
  • What if actions with similar payoffs are grouped
    together?

0
1
22
UCT Upper Confidence based Tree search
  • Rule 1 Keep a counter and an average in each
    node
  • Rule 2 View each node as a bandit problem

23
Example (t1)
0/01
0/01
1/11
0/01
1/11
0/01
24
Example (t2)
0/11
0/01
1/11.2
1/1
0/1
0/0
25
Example (t3)
0/11.5
1/11.0
2/21.0
1/1
2/2
0/1
26
Example (t4)
0/11.7
2/21.2
3/31.2
2/2
3/3
0/1
27
What is the next time a suboptimal action is
sampled?
  • Estimated value of good actions 1
  • Estimated value of bad actions 0
  • Bad action is selected when..
  • ) t6
  • Next?
  • ) t15
  • ) t30 ..

28
UCT variations
  • Game tree search (minimax)
  • Transposition tables (Hashing)
  • Alternative bias sequence
  • l steps to leaf, d depth in tree
  • d0,lL e1/2 dL,l0 e1
  • Stopping episodes earlier and using evaluation
    functions
  • Iterative deepening

29
UCT variations
  • Clever move-selection (prior knowledge in the
    form of policies)
  • Mixing conceptually different action groupings
  • Overlapping action groups

30
Theoretical results
  • ConsistencyProbability of choosing the best
    action converges to 1 (no bias)
  • Rate of convergenceThe rate of convergence is
    dependent on the effective size of the tree only

31
Planning in MDPs Sailing
  • Goal Search for an optimal action

http//www.sor.princeton.edu/rvdb/sail/sail.html
32
Planning in MDPs Sailing
  • Modifications to UCT
  • Terminating episodes with prob. 1/Ts(t)
  • Use value function estimate (1?(x))V(x),?(x)2
    -0.1,0.1
  • Evaluation
  • 1000 random initial states
  • Loss Q(xi,a)-V(xi) averaged
  • Competitors
  • ARTDPBoltzmann Barto et al., 1991
  • PG-ID PeretGarcia,ECAI04

33
Planning in MDPs Sailing
PG-IDPeretGarcia,ECAI04
34
Results in games
  • Go
  • UCT-advantage 300 ELO points
  • MoGo Gelly et al.
  • Ranked 1 on CGOS since August 2006
  • Kiseido Go Server Tournament winner (13x13,9x9)
  • CrazyStone(UCT) R. Coulom
  • Viking (Valkyria-UCT) M. Persson
  • Other games
  • Amazons
  • Clobber

35
Thank you!
  • Questions?

36
References
  • Optimism in the face of uncertainty Lai, T. L.
    and Robbins, H. (1985). Asymptotically efficient
    adaptive allocation rules. Advances in Applied
    Mathematics, 6422.
  • UCB1 Auer, P., Cesa-Bianchi, N., and Fischer, P.
    (2002). Finite time analysis of the multiarmed
    bandit problem. Machine Learning,
    47(2-3)235256.
  • Audibert, J., Munos, R., and Szepesvári, Cs.
    (2006). Use of variance estimation in the
    multi-armed bandit problem. (NIPS Workshop on
    Exploration and Exploitaton)
  • UCT Kocsis, L. and Szepesvári, Cs. (2006).
    Bandit based Monte-Carlo planning. ECML-06.
  • Go http//cgos.boardspace.net/9x9.html,
    http//senseis.xmp.net/?MoGo
  • György, A., Kocsis, L., Szabó, I., and
    Szepesvári, Cs. (2007). Continuous time
    associative bandit problems. In IJCAI06.
Write a Comment
User Comments (0)
About PowerShow.com