Title: Exploration Scavenging
1Exploration Scavenging
- Alex Strehl
- Joint work with John Langford and Jenn Wortman.
- Yahoo! Research University of Pennsylvania
2Offline Policy Learning Problem
First months policy
Second months policy
Can we find a better policy given click logs from
each policy?
3Formalization
Input (query) x?X chosen from input distribution.
Receive payoff (click) r, 0 or 1, as an unknown
and noisy function of x and a.
Choose an action (ad) a
Offline Policy evaluation (or maximization) given
data from
old policy, how do we evaluate a new
deterministic policy h
Potential applications to medical treatments,
robotics, etc...
4Importance-weighting approach
Basic observation Only the reward of the
displayed ad is observed ? Cant use supervised
learning. One approach to consider is importance
sampling. It relies on the logging policy being
explicitly randomized (Auer et. al. 1997 Precup,
Sutton, Singh 2000).The key observation is
Unfortunate problem The logging policies are not
randomized. What can we do?
5Outline
- Part 1 Theory
- Policy Estimation and Main Result.
- Choosing among Multiple Policies.
- Impossibility Result.
- Part 2 Application to Web advertising
- Dealing with slates of advertisements.
- Proof-of-concept experiments on real data.
6Estimator
- From data
, - we form the estimator
7Multiple Policies
Current policy changes over time. Denote the set
of historical policies Redefining the action
space to be , we evaluate policies of the form
8Impossibility Example
The theorem is false if actions are allowed to
depend on input. 2 inputs, 0 and 1. 2 actions,
0 and 1. Old policy ?(x) x, is
deterministic. We cannot evaluate the new policy
h(x) 1-x.
9Overcoming Determinism
Old policy cycles through the actions
a1,a2,a3,a1,a2,a3,... Since the decision at each
time is deterministic importance-sampling does
not apply. Since the decision is independent of
input, our result holds. Result fundamentally
depends on fixed relationship between input and
reward.
10Outline
- Part 1 Theory
- Policy Estimation and Main Result.
- Choosing among Multiple Policies.
- Part 2 Application to Web advertising
- Dealing with slates of advertisements.
- Proof-of-concept experiments on real data.
11Internet Advertising Application
Input xt web page, Action at slate of
advertisement shown, Reward rt clicks (or
revenue). Due to the large number of actions,
the accuracy of our estimator is very poor. How
do we deal with slates of actions?
12Attention Decay Coefficients
We make the standard factoring assumption
(Borgs, et.al. 2007, Lahie Pennock
2007) ProbClickp, a, position i
CiProbp,a. We have developed a particular way
to estimate the attention decay coefficients Ci,
that is particularly robust to low probability
events. We use exploration scavenging to
evaluate policies that reorder the slate chosen
by the current system.
13Empirical Results
- Estimating the Attention Decay Coefficients
Coefficients Cp
Position p
14Evaluation on Yahoo!s data set.
15Conclusion
This is the first provably sound policy
evaluation and selection method using offline
data with many features and a deterministic
policy. With an additional assumption, it
provides estimates of reordering policies in
online advertising.
16 17Overcoming Determinism
Given data (x1,a1,r1),(x2,a2,r2),...,(xm,am,rm),
where ai is chosen from a time-dependent and
history-dependent function fi of xi, is
equivalent to a stochastic function F that
randomizes over fi. As long as each fi doesnt
depend on xi, then estimating F by empirical
estimates of Pr(action) is legitimate. Result
fundamentally depends on fixed relationship
between input and reward.