Using upper confidence bounds to control exploration and exploitation - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Using upper confidence bounds to control exploration and exploitation

Description:

Using upper confidence bounds to control exploration and exploitation. Csaba Szepesv ri ... UCB1: Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002) ... – PowerPoint PPT presentation

Number of Views:172

Avg rating:3.0/5.0

Slides: 37

Provided by: csabasze6

Category:

more less

Transcript and Presenter's Notes

Title: Using upper confidence bounds to control exploration and exploitation

1
Using upper confidence bounds to control
exploration and exploitation

Csaba Szepesvári
UofA
December 8, 2006
On-line Trading of Exploration and Exploitation
NIPS 2006 Workshop
szepesva_at_cs.ualberta.ca

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAAAAA
2
Contents

Bandit problems
Upper confidence based algorithms
Bandits in continuous time
Bandits with large action spaces
Conclusions

3
Exploration vs. Exploitation

Two treatments
Unknown success probabilities
Goal
find the best treatment while loosing the
smallest number of patients
Explore or exploit?

4
Exploration vs. Exploitation Some Applications

Simple processes
Clinical trials
Job shop scheduling (random jobs)
What ad to put on a web-page
More complex processes (memory)
Optimizing production
Controlling an inventory
Optimal investment
Poker

5
Bandit Problems Optimism in the Face
Uncertainty

Introduced by Lai and Robbins (1985) (?)
Payoffs of two options are i.i.d.
X11,X12,,X1t,
X21,X22,,X2t,
Principle
Inflated value of an option maximum expected
reward that looks quite possible given the
observations so far
Select the option with best inflated value

6
Parametric Bandits LaiRobbins

Xit/ pi,?i(), ?i unknown, t1,2,
Uncertainty setReasonable values of ? given
the experience so far Ui,t? P(Xi,1Ti(t)?)
is large(t,Ti(t))
Inflated values Zi,tmax E??2 Ui,t
Rule It arg maxi Zi,t

7
Bounds

Upper bound
Lower boundIf an algorithm is uniformly good
then..

8
UCB1 Algorithm (Auer et al., 2002)

Algorithm
Try all options ones
Use option j with the highest index (pgt2)
Regret bound
Rn Expected loss due to not selecting the best
option at time step n. Then

9
Bandits in continuous time
András György Levente Kocsis Ivett Szabó
10
Bandits in Continuous Time

Task scheduling
Random tasks, finite types
Several proc. options
Until processing is done, no other task can be
processed

11
Formal framework

i.i.d. covariates x1,x2,,xt, 2 X, Xlt1
rewards and delays (rit(x),?it(x))
rmin rit(x) rmax ?min
?it(x) ?max ri(x) E ri1(x)
?i(x) E ?i1(x)

12
Evaluating allocation rules (policies)

It choice at time t
Reward at time t rt ri,It(xt)
Delay at time t ?t ?i,It(xt)
Gain of a policy

13
Gain, action values and regret

Optimal gain
Action values
Regret

14
Model-based UCB

Estimate p(x), ri(x), ?i(x)
Choose policy ut argmaxu ?u(p,r,?)
(p,r,?)2 Mt
Problem Regret will depend on maxx 1/p(x) !
Idea Avoid estimating p!

15
Algorithm
16
Regret bound

Theorem

17
Key proposition

PropositionAssume that the following conditions
are satisfiedThen

18
Open problems

Use variance estimates
Use average reward per step so-far
BurnetasKatehakis (MOR, 1997)
Inflate only the current action-value

19
..when the number of actions is
large

Levente Kocsis
Remi Munos

20
Bandits with large action-spaces

Problem
Bandit problem, 0 Xit 1, iid
Number of actions is large
Regret bound for UCB1
Scales badly with the number of suboptimal
actions!
Example Planning in sequential problems
action sequence of actions

21
Structure helps!

What if actions with similar payoffs are grouped
together?

0
1
22
UCT Upper Confidence based Tree search

Rule 1 Keep a counter and an average in each
node
Rule 2 View each node as a bandit problem

23
Example (t1)
0/01
0/01
1/11
0/01
1/11
0/01
24
Example (t2)
0/11
0/01
1/11.2
1/1
0/1
0/0
25
Example (t3)
0/11.5
1/11.0
2/21.0
1/1
2/2
0/1
26
Example (t4)
0/11.7
2/21.2
3/31.2
2/2
3/3
0/1
27
What is the next time a suboptimal action is
sampled?

Estimated value of good actions 1
Estimated value of bad actions 0
Bad action is selected when..
) t6
Next?
) t15
) t30 ..

28
UCT variations

Game tree search (minimax)
Transposition tables (Hashing)
Alternative bias sequence
l steps to leaf, d depth in tree
d0,lL e1/2 dL,l0 e1
Stopping episodes earlier and using evaluation
functions
Iterative deepening

29
UCT variations

Clever move-selection (prior knowledge in the
form of policies)
Mixing conceptually different action groupings
Overlapping action groups

30
Theoretical results

ConsistencyProbability of choosing the best
action converges to 1 (no bias)
Rate of convergenceThe rate of convergence is
dependent on the effective size of the tree only

31
Planning in MDPs Sailing

Goal Search for an optimal action

http//www.sor.princeton.edu/rvdb/sail/sail.html
32
Planning in MDPs Sailing

Modifications to UCT
Terminating episodes with prob. 1/Ts(t)
Use value function estimate (1?(x))V(x),?(x)2
-0.1,0.1
Evaluation
1000 random initial states
Loss Q(xi,a)-V(xi) averaged
Competitors
ARTDPBoltzmann Barto et al., 1991
PG-ID PeretGarcia,ECAI04

33
Planning in MDPs Sailing
PG-IDPeretGarcia,ECAI04
34
Results in games

Go
UCT-advantage 300 ELO points
MoGo Gelly et al.
Ranked 1 on CGOS since August 2006
Kiseido Go Server Tournament winner (13x13,9x9)
CrazyStone(UCT) R. Coulom
Viking (Valkyria-UCT) M. Persson
Other games
Amazons
Clobber

35
Thank you!

Questions?

36
References

Optimism in the face of uncertainty Lai, T. L.
and Robbins, H. (1985). Asymptotically efficient
adaptive allocation rules. Advances in Applied
Mathematics, 6422.
UCB1 Auer, P., Cesa-Bianchi, N., and Fischer, P.
(2002). Finite time analysis of the multiarmed
bandit problem. Machine Learning,
47(2-3)235256.
Audibert, J., Munos, R., and Szepesvári, Cs.
(2006). Use of variance estimation in the
multi-armed bandit problem. (NIPS Workshop on
Exploration and Exploitaton)
UCT Kocsis, L. and Szepesvári, Cs. (2006).
Bandit based Monte-Carlo planning. ECML-06.
Go http//cgos.boardspace.net/9x9.html,
http//senseis.xmp.net/?MoGo
György, A., Kocsis, L., Szabó, I., and
Szepesvári, Cs. (2007). Continuous time
associative bandit problems. In IJCAI06.