Multi-armed Bandit Problems with Dependent Arms

About This Presentation
Title:

Multi-armed Bandit Problems with Dependent Arms

Description:

Pull arms sequentially so as to maximize the total expected reward ... Pick the cluster with the largest index, and pull the corresponding arm ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 32
Provided by: DC2
Learn more at: http://www.cs.cmu.edu

less

Transcript and Presenter's Notes

Title: Multi-armed Bandit Problems with Dependent Arms


1
Multi-armed Bandit Problems with Dependent Arms
  • Sandeep Pandey (spandey_at_cs.cmu.edu)
  • Deepayan Chakrabarti (deepay_at_yahoo-inc.com)
  • Deepak Agarwal (dagarwal_at_yahoo-inc.com)

2
Background Bandits
Bandit arms
  • Pull arms sequentially so as to maximize the
    total expected reward
  • Show ads on a webpage to maximize clicks
  • Product recommendation to maximize sales

3
Dependent Arms
  • Reward probabilities µi are generally assumed to
    be independent of each other
  • What if they are dependent?
  • E.g., ads on similar topics, using similar
    text/phrases, should have similar rewards

Skiing, snowboarding
Skiing, snowshoes
Get Vonage!
Snowshoe rental
µ10.3
µ20.28
µ310-6
µ20.31
4
Dependent Arms
  • Reward probabilities µi are generally assumed to
    be independent of each other
  • What if they are dependent?
  • E.g., ads on similar topics, using similar
    text/phrases, should have similar rewards
  • A click on one ad ? other similar ads may
    generate clicks as well
  • Can we increase total reward using this
    dependency?

5
Cluster Model of Dependence
Cluster 1
Cluster 2
µi f(pi)
Successes si Bin(ni, µi)
6
Cluster Model of Dependence
µi f(p1)
µi f(p2)
  • Total reward
  • Discounted ? at.ER(t), a discounting
    factor
  • Undiscounted ? ER(t)

7
Discounted Reward
x1 x2
Arm 2
The optimal policy can be computed using
per-cluster MDPs only.
MDP for cluster 1
Pull Arm 1
x1 x2
x1 x2
  • Optimal Policy
  • Compute an (index, arm) pair for each cluster
  • Pick the cluster with the largest index, and
    pull the corresponding arm

x3 x4
Arm 4
MDP for cluster 2
Pull Arm 3
x3 x4
x3 x4
8
Discounted Reward
x1 x2
Arm 2
The optimal policy can be computed using
per-cluster MDPs only.
MDP for cluster 1
Pull Arm 1
x1 x2
  • Reduces the problem to smaller state spaces
  • Reduces to Gittins Theorem 1979 for
    independent bandits
  • Approximation bounds on the index for k-step
    lookahead

x1 x2
  • Optimal Policy
  • Compute an (index, arm) pair for each cluster
  • Pick the cluster with the largest index, and
    pull the corresponding arm

x3 x4
Arm 4
MDP for cluster 2
Pull Arm 3
x3 x4
x3 x4
9
Cluster Model of Dependence
µi f(p1)
µi f(p2)
  • Total reward
  • Discounted ? at.ER(t), a discounting
    factor
  • Undiscounted ? ER(t)

8
t0
T
t0
10
Undiscounted Reward
Cluster arm 1
Cluster arm 2
All arms in a cluster are similar ? They
can be grouped into one hypothetical
cluster arm
11
Undiscounted Reward
  • Two-Level Policy
  • In each iteration
  • Pick cluster arm using a traditional bandit
    policy
  • Pick an arm within that cluster using a
    traditional bandit policy

Cluster arm 1
Cluster arm 2
Each cluster arm must have some estimated
reward probability
12
Issues
  • What is the reward probability of a cluster
    arm?
  • How do cluster characteristics affect performance?

13
Reward probability of a cluster arm
  • What is the reward probability r of a cluster
    arm?
  • MEAN r ?si / ?ni, i.e., average success
    rate, summing over all arms in the cluster
    Kocsis/2006, Pandey/2007
  • Initially, r µavg average µ of arms in
    cluster
  • Finally, r µmax max µ among arms in cluster
  • Drift in the reward probability of the cluster
    arm

14
Reward probability drift causes problems
Best (optimal) arm, with reward probability µopt
Cluster 1
Cluster 2
(opt cluster)
  • Drift ? Non-optimal clusters might temporarily
    look better ? optimal arm is explored only O(log
    T) times

15
Reward probability of a cluster arm
  • What is the reward probability r of a cluster
    arm?
  • MEAN r ?si / ?ni
  • MAX r max( Eµi )
  • PMAX r E max(µi)
  • Both MAX and PMAX aim to estimate µmax and thus
    reduce drift

for all arms i in cluster
16
Reward probability of a cluster arm
Bias in estimation of µmax
  • MEAN r ?si / ?ni
  • MAX r max( Eµi )
  • PMAX r E max(µi)
  • Both MAX and PMAX aim to estimate µmax and thus
    reduce drift

Variance of estimator
High Unbiased
Low High
17
Comparison of schemes
  • 10 clusters, 11.3 arms/cluster

MAX performs best
18
Issues
  • What is the reward probability of a cluster
    arm?
  • How do cluster characteristics affect performance?

19
Effects of cluster characteristics
  • We analytically study the effects of cluster
    characteristics on the crossover-time
  • Crossover-time Time when the expected reward
    probability of the optimal cluster becomes
    highest among all cluster arms

20
Effects of cluster characteristics
  • Crossover-time Tc for MEAN depends on
  • Cluster separation ? µopt µmax outside opt
    cluster ? increases ? Tc decreases
  • Cluster size Aopt Aopt increases ? Tc
    increases
  • Cohesiveness in opt cluster 1-avg(µopt µi)
    Cohesiveness increases ? Tc decreases

21
Experiments (effect of separation)
? increases ? Tc decreases ? higher reward
22
Experiments (effect of size)
Aopt increases ? Tc increases ? lower reward
23
Experiments (effect of cohesiveness)
Cohesiveness increases ? Tc decreases ? higher
reward
24
Related Work
  • Typical multi-armed bandit problems
  • Do not consider dependencies
  • Very few arms
  • Bandits with side information
  • Cannot handle dependencies among arms
  • Active learning
  • Emphasis on examples required to achieve a given
    prediction accuracy

25
Conclusions
  • We analyze bandits where dependencies are
    encapsulated within clusters
  • Discounted Reward ? the optimal policy is an
    index scheme on the clusters
  • Undiscounted Reward
  • Two-level Policy with MEAN, MAX, and PMAX
  • Analysis of the effect of cluster characteristics
    on performance, for MEAN

26
Discounted Reward
1
3
4
2
x1 x2
x3 x4
  • Create a belief-state MDP
  • Each state contains the estimated reward
    probabilities for all arms
  • Solve for optimal

27
Background Bandits
Bandit arms
Regret optimal payoff actual payoff
28
Reward probability of a cluster arm
  • What is the reward probability of a cluster
    arm?
  • Eventually, every cluster arm must converge to
    the most rewarding arm µmax within that cluster
  • since a bandit policy is used within each cluster
  • However, drift causes problems

29
Experiments
  • Simulation based on one weeks worth of data from
    a large-scale ad-matching application
  • 10 clusters, with 11.3 arms/cluster on average

30
Comparison of schemes
  • 10 clusters, 11.3 arms/cluster
  • Cluster separation ? 0.08
  • Cluster size Aopt 31
  • Cohesiveness 0.75

MAX performs best
31
Reward probability drift causes problems
Best (optimal) arm, with reward probability µopt
Cluster 1
Cluster 2
(opt cluster)
  • Intuitively, to reduce regret, we must
  • Quickly converge to the optimal cluster arm
  • and then to the best arm within that cluster
Write a Comment
User Comments (0)