Title: Using upper confidence bounds to control exploration and exploitation
1Using upper confidence bounds to control
exploration and exploitation
- Csaba Szepesvári
- UofA
- December 8, 2006
- On-line Trading of Exploration and Exploitation
- NIPS 2006 Workshop
- szepesva_at_cs.ualberta.ca
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAAAAA
2Contents
- Bandit problems
- Upper confidence based algorithms
- Bandits in continuous time
- Bandits with large action spaces
- Conclusions
3Exploration vs. Exploitation
- Two treatments
- Unknown success probabilities
- Goal
- find the best treatment while loosing the
smallest number of patients - Explore or exploit?
4Exploration vs. Exploitation Some Applications
- Simple processes
- Clinical trials
- Job shop scheduling (random jobs)
- What ad to put on a web-page
- More complex processes (memory)
- Optimizing production
- Controlling an inventory
- Optimal investment
- Poker
5Bandit Problems Optimism in the Face
Uncertainty
- Introduced by Lai and Robbins (1985) (?)
- Payoffs of two options are i.i.d.
- X11,X12,,X1t,
- X21,X22,,X2t,
- Principle
- Inflated value of an option maximum expected
reward that looks quite possible given the
observations so far - Select the option with best inflated value
6Parametric Bandits LaiRobbins
- Xit/ pi,?i(), ?i unknown, t1,2,
- Uncertainty setReasonable values of ? given
the experience so far Ui,t? P(Xi,1Ti(t)?)
is large(t,Ti(t)) - Inflated values Zi,tmax E??2 Ui,t
- Rule It arg maxi Zi,t
7Bounds
- Upper bound
- Lower boundIf an algorithm is uniformly good
then..
8UCB1 Algorithm (Auer et al., 2002)
- Algorithm
- Try all options ones
- Use option j with the highest index (pgt2)
- Regret bound
- Rn Expected loss due to not selecting the best
option at time step n. Then
9Bandits in continuous time
András György Levente Kocsis Ivett Szabó
10Bandits in Continuous Time
- Task scheduling
- Random tasks, finite types
- Several proc. options
- Until processing is done, no other task can be
processed
11Formal framework
- i.i.d. covariates x1,x2,,xt, 2 X, Xlt1
- rewards and delays (rit(x),?it(x))
rmin rit(x) rmax ?min
?it(x) ?max ri(x) E ri1(x)
?i(x) E ?i1(x) -
12Evaluating allocation rules (policies)
- It choice at time t
- Reward at time t rt ri,It(xt)
- Delay at time t ?t ?i,It(xt)
- Gain of a policy
13Gain, action values and regret
- Optimal gain
- Action values
- Regret
14Model-based UCB
- Estimate p(x), ri(x), ?i(x)
- Choose policy ut argmaxu ?u(p,r,?)
(p,r,?)2 Mt - Problem Regret will depend on maxx 1/p(x) !
- Idea Avoid estimating p!
15Algorithm
16Regret bound
17Key proposition
- PropositionAssume that the following conditions
are satisfiedThen
18Open problems
- Use variance estimates
- Use average reward per step so-far
- BurnetasKatehakis (MOR, 1997)
- Inflate only the current action-value
-
19..when the number of actions is
large
- Levente Kocsis
- Remi Munos
20Bandits with large action-spaces
- Problem
- Bandit problem, 0 Xit 1, iid
- Number of actions is large
- Regret bound for UCB1
- Scales badly with the number of suboptimal
actions! - Example Planning in sequential problems
action sequence of actions
21Structure helps!
- What if actions with similar payoffs are grouped
together?
0
1
22UCT Upper Confidence based Tree search
- Rule 1 Keep a counter and an average in each
node - Rule 2 View each node as a bandit problem
23Example (t1)
0/01
0/01
1/11
0/01
1/11
0/01
24Example (t2)
0/11
0/01
1/11.2
1/1
0/1
0/0
25Example (t3)
0/11.5
1/11.0
2/21.0
1/1
2/2
0/1
26Example (t4)
0/11.7
2/21.2
3/31.2
2/2
3/3
0/1
27What is the next time a suboptimal action is
sampled?
- Estimated value of good actions 1
- Estimated value of bad actions 0
- Bad action is selected when..
- ) t6
- Next?
- ) t15
- ) t30 ..
28UCT variations
- Game tree search (minimax)
- Transposition tables (Hashing)
- Alternative bias sequence
- l steps to leaf, d depth in tree
- d0,lL e1/2 dL,l0 e1
- Stopping episodes earlier and using evaluation
functions - Iterative deepening
29UCT variations
- Clever move-selection (prior knowledge in the
form of policies) - Mixing conceptually different action groupings
- Overlapping action groups
30Theoretical results
- ConsistencyProbability of choosing the best
action converges to 1 (no bias) - Rate of convergenceThe rate of convergence is
dependent on the effective size of the tree only
31Planning in MDPs Sailing
- Goal Search for an optimal action
http//www.sor.princeton.edu/rvdb/sail/sail.html
32Planning in MDPs Sailing
- Modifications to UCT
- Terminating episodes with prob. 1/Ts(t)
- Use value function estimate (1?(x))V(x),?(x)2
-0.1,0.1 - Evaluation
- 1000 random initial states
- Loss Q(xi,a)-V(xi) averaged
- Competitors
- ARTDPBoltzmann Barto et al., 1991
- PG-ID PeretGarcia,ECAI04
33Planning in MDPs Sailing
PG-IDPeretGarcia,ECAI04
34Results in games
- Go
- UCT-advantage 300 ELO points
- MoGo Gelly et al.
- Ranked 1 on CGOS since August 2006
- Kiseido Go Server Tournament winner (13x13,9x9)
- CrazyStone(UCT) R. Coulom
- Viking (Valkyria-UCT) M. Persson
- Other games
- Amazons
- Clobber
35Thank you!
36References
- Optimism in the face of uncertainty Lai, T. L.
and Robbins, H. (1985). Asymptotically efficient
adaptive allocation rules. Advances in Applied
Mathematics, 6422. - UCB1 Auer, P., Cesa-Bianchi, N., and Fischer, P.
(2002). Finite time analysis of the multiarmed
bandit problem. Machine Learning,
47(2-3)235256. - Audibert, J., Munos, R., and Szepesvári, Cs.
(2006). Use of variance estimation in the
multi-armed bandit problem. (NIPS Workshop on
Exploration and Exploitaton) - UCT Kocsis, L. and Szepesvári, Cs. (2006).
Bandit based Monte-Carlo planning. ECML-06. - Go http//cgos.boardspace.net/9x9.html,
http//senseis.xmp.net/?MoGo - György, A., Kocsis, L., Szabó, I., and
Szepesvári, Cs. (2007). Continuous time
associative bandit problems. In IJCAI06.