Exploration Scavenging - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Exploration Scavenging

Description:

Exploration Scavenging. Alex Strehl. Joint work with John Langford and Jenn ... We use exploration scavenging to evaluate policies that reorder the slate chosen ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 18
Provided by: Yah983
Category:

less

Transcript and Presenter's Notes

Title: Exploration Scavenging


1
Exploration Scavenging
  • Alex Strehl
  • Joint work with John Langford and Jenn Wortman.
  • Yahoo! Research University of Pennsylvania

2
Offline Policy Learning Problem
First months policy
Second months policy
Can we find a better policy given click logs from
each policy?
3
Formalization
Input (query) x?X chosen from input distribution.
Receive payoff (click) r, 0 or 1, as an unknown
and noisy function of x and a.
Choose an action (ad) a
Offline Policy evaluation (or maximization) given
data from
old policy, how do we evaluate a new
deterministic policy h
Potential applications to medical treatments,
robotics, etc...
4
Importance-weighting approach
Basic observation Only the reward of the
displayed ad is observed ? Cant use supervised
learning. One approach to consider is importance
sampling. It relies on the logging policy being
explicitly randomized (Auer et. al. 1997 Precup,
Sutton, Singh 2000).The key observation is
Unfortunate problem The logging policies are not
randomized. What can we do?
5
Outline
  • Part 1 Theory
  • Policy Estimation and Main Result.
  • Choosing among Multiple Policies.
  • Impossibility Result.
  • Part 2 Application to Web advertising
  • Dealing with slates of advertisements.
  • Proof-of-concept experiments on real data.

6
Estimator
  • From data
    ,
  • we form the estimator

7
Multiple Policies
Current policy changes over time. Denote the set
of historical policies Redefining the action
space to be , we evaluate policies of the form
8
Impossibility Example
The theorem is false if actions are allowed to
depend on input. 2 inputs, 0 and 1. 2 actions,
0 and 1. Old policy ?(x) x, is
deterministic. We cannot evaluate the new policy
h(x) 1-x.
9
Overcoming Determinism
Old policy cycles through the actions
a1,a2,a3,a1,a2,a3,... Since the decision at each
time is deterministic importance-sampling does
not apply. Since the decision is independent of
input, our result holds. Result fundamentally
depends on fixed relationship between input and
reward.
10
Outline
  • Part 1 Theory
  • Policy Estimation and Main Result.
  • Choosing among Multiple Policies.
  • Part 2 Application to Web advertising
  • Dealing with slates of advertisements.
  • Proof-of-concept experiments on real data.

11
Internet Advertising Application
Input xt web page, Action at slate of
advertisement shown, Reward rt clicks (or
revenue). Due to the large number of actions,
the accuracy of our estimator is very poor. How
do we deal with slates of actions?
12
Attention Decay Coefficients
We make the standard factoring assumption
(Borgs, et.al. 2007, Lahie Pennock
2007) ProbClickp, a, position i
CiProbp,a. We have developed a particular way
to estimate the attention decay coefficients Ci,
that is particularly robust to low probability
events. We use exploration scavenging to
evaluate policies that reorder the slate chosen
by the current system.
13
Empirical Results
  • Estimating the Attention Decay Coefficients

Coefficients Cp
Position p
14
Evaluation on Yahoo!s data set.
15
Conclusion
This is the first provably sound policy
evaluation and selection method using offline
data with many features and a deterministic
policy. With an additional assumption, it
provides estimates of reordering policies in
online advertising.
16
  • Thanks for Listening

17
Overcoming Determinism
Given data (x1,a1,r1),(x2,a2,r2),...,(xm,am,rm),
where ai is chosen from a time-dependent and
history-dependent function fi of xi, is
equivalent to a stochastic function F that
randomizes over fi. As long as each fi doesnt
depend on xi, then estimating F by empirical
estimates of Pr(action) is legitimate. Result
fundamentally depends on fixed relationship
between input and reward.
Write a Comment
User Comments (0)
About PowerShow.com