Learning to Rank (part 1)

About This Presentation

Title:

Learning to Rank (part 1)

Description:

This talk is very ML-centric. Use IR methods to generate features ... Okapi BM25 [Robertson et al., 1995] Language Models [Ponte & Croft, 1998] ... – PowerPoint PPT presentation

Number of Views:292

Avg rating:3.0/5.0

Slides: 67

Provided by: Yison4

more less

Transcript and Presenter's Notes

Title: Learning to Rank (part 1)

1
Learning to Rank (part 1)

NESCAI 2008 Tutorial
Yisong Yue
Cornell University

2
Booming Search Industry
3
Goals for this Tutorial

Basics of information retrieval
What machine learning contributes
New challenges to address
New insights on developing ML algorithms

4
(Soft) Prerequisites

Basic knowledge of ML algorithms
Support Vector Machines
Neural Nets
Decision Trees
Boosting
Etc
Will introduce IR concepts as needed

5
Outline (Part 1)

Conventional IR Methods (no learning)
1970s to 1990s
Ordinal Regression
1994 onwards
Optimizing Rank-Based Measures
2005 to present

6
Outline (Part 2)

Effectively collecting training data
E.g., interpreting clickthrough data
Beyond independent relevance
E.g., diversity
Summary Discussion

7
Disclaimer

This talk is very ML-centric
Use IR methods to generate features
Learn good ranking functions on feature space
Focus on optimizing cleanly formulated objectives
Outperform traditional IR methods

8
Disclaimer

This talk is very ML-centric
Use IR methods to generate features
Learn good ranking functions on feature space
Focus on optimizing cleanly formulated objectives
Outperform traditional IR methods
Information Retrieval
Broader than the scope of this talk
Deals with more sophisticated modeling questions
Will see more interplay between IR and ML in Part
2

9
Brief Overview of IR

Predated the internet
As We May Think by Vannevar Bush (1945)
Active research topic by the 1960s
Vector Space Model (1970s)
Probabilistic Models (1980s)
Introduction to Information Retrieval (2008)
C. Manning, P. Raghavan H. Schütze

10
Basic Approach to IR

Given query q and set of docs d1, dn
Find documents relevant to q
Typically expressed as a ranking on d1, dn

11
Basic Approach to IR

Given query q and set of docs d1, dn
Find documents relevant to q
Typically expressed as a ranking on d1, dn
Similarity measure sim(a,b)!R
Sort by sim(q,di)
Optimal if relevance of documents are
independent. Robertson, 1977

12
Vector Space Model

Represent documents as vectors
One dimension for each word
Queries as short documents
Similarity Measures
Cosine similarity normalized dot product

13
Cosine Similarity Example
14
Other Methods

TF-IDF
Salton Buckley, 1988
Okapi BM25
Robertson et al., 1995
Language Models
Ponte Croft, 1998
Zhai Lafferty, 2001

15
Machine Learning

IR uses fixed models to define similarity scores
Many opportunities to learn models
Appropriate training data
Appropriate learning formulation
Will mostly use SVM formulations as examples
General insights are applicable to other
techniques.

16
Training Data

Supervised learning problem
Document/query pairs
Embedded in high dimensional feature space
Labeled by relevance of doc to query
Traditionally 0/1
Recently ordinal classes of relevance (0,1,2,3,)

17
Feature Space

Use to learn a similarity/compatibility function
Based off existing IR methods
Can use raw values
Or transformations of raw values
Based off raw words
Capture co-occurrence of words

18
Training Instances
19
Learning Problem

Given training instances
(xq,d, yq,d) for q 1..N, d 1 .. Nq
Learn a ranking function
f(xq,1, xq,Nq ) ! Ranking
Typically decomposed into per doc scores
f(x) ! R (doc/query compatibility)
Sort by scores for all instances of a given q

20
How to Train?

Classification Regression
Learn f(x) ! R in conventional ways
Sort by f(x) for all docs for a query
Typically does not work well
2 Major Problems
Labels have ordering
Additional structure compared to multiclass
problems
Severe class imbalance
Most documents are not relevant

21
Somewhat Relevant
Very Relevant
Not Relevant
Conventional multiclass learning does not
incorporate ordinal structure of class labels
22
Somewhat Relevant
Very Relevant
Not Relevant
Conventional multiclass learning does not
incorporate ordinal structure of class labels
23
Ordinal Regression

Assume class labels are ordered
True since class labels indicate level of
relevance
Learn hypothesis function f(x) ! R
Such that the ordering of f(x) agrees with label
ordering
Ex given instances (x, 1), (y, 1), (z, 2)
f(x) lt f(z)
f(y) lt f(z)
Dont care about f(x) vs f(y)

24
Ordinal Regression

Compare with classification
Similar to multiclass prediction
But classes have ordinal structure
Compare with regression
Doesnt necessarily care about value of f(x)
Only care that ordering is preserved

25
Ordinal Regression Approaches

Learn multiple thresholds
Learn multiple classifiers
Optimize pairwise preferences

26
Option 1 Multiple Thresholds

Maintain T thresholds (b1, bT)
b1 lt b2 lt lt bT
Learn model parameters (b1, , bT)
Goal
Model predicts a score on input example
Minimize threshold violation of predictions

27
Ordinal SVM Example
Chu Keerthi, 2005
28
Ordinal SVM Formulation
Such that for j 0..T
And also
Chu Keerthi, 2005
29
Learning Multiple Thresholds

Gaussian Processes
Chu Ghahramani, 2005
Decision Trees
Kramer et al., 2001
Neural Nets
RankProp Caruana et al., 1996
SVMs Perceptrons
PRank Crammer Singer, 2001
Chu Keerthi, 2005

30
Option 2 Voting Classifiers

Use T different training sets
Classifier 1 predicts 0 vs 1,2,T
Classifier 2 predicts 0,1 vs 2,3,T
Classifier T predicts 0,1,,T-1 vs T
Final prediction is combination
E.g., sum of predictions
Recent work
McRank Li et al., 2007
Qin et al., 2007

Severe class imbalance
Near perfect performance by always predicting 0

32
Option 3 Pairwise Preferences

Most popular approach for IR applications
Learn model to minimize pairwise disagreements
(Pairwise Agreements) ROC-Area

2 pairwise disagreements

34
Optimizing Pairwise Preferences

Consider instances (x1,y1) and (x2,y2)
Label order has y1 gt y2

35
Optimizing Pairwise Preferences

Consider instances (x1,y1) and (x2,y2)
Label order has y1 gt y2
Create new training instance
(x, 1) where x (x1 x2)
Repeat for all instance pairs with label order
preference

36
Optimizing Pairwise Preferences

Result new training set!
Often represented implicitly
Has only positive examples
Mispredicting means that a lower ordered instance
received higher score than higher order instance.

37
Pairwise SVM Formulation
Such that
Herbrich et al., 1999
Can be reduced to time
Joachims, 2005.
38
Optimizing Pairwise Preferences

Neural Nets
RankNet Burges et al., 2005
Boosting Hedge-Style Methods
Cohen et al., 1998
RankBoost Freund et al., 2003
Long Servidio, 2007
SVMs
Herbrich et al., 1999
SVM-perf Joachims, 2005
Cao et al., 2006

39
Rank-Based Measures

Pairwise Preferences not quite right
Assigns equal penalty for errors no matter where
in the ranking
People (mostly) care about top of ranking
IR community use rank-based measures which
capture this property.

40
Rank-Based Measures

Binary relevance
Precision_at_K (P_at_K)
Mean Average Precision (MAP)
Mean Reciprocal Rank (MRR)
Multiple levels of relevance
Normalized Discounted Cumulative Gain (NDCG)

41
Precision_at_K

Set a rank threshold K
Compute relevant in top K
Ignores documents ranked lower than K
Ex
Prec_at_3 of 2/3
Prec_at_4 of 2/4
Prec_at_5 of 3/5

42
Mean Average Precision

Consider rank position of each relevance doc
K1, K2, KR
Compute Precision_at_K for each K1, K2, KR
Average precision average of P_at_K
Ex has AvgPrec of
MAP is Average Precision across multiple
queries/rankings

43
Mean Reciprocal Rank

Consider rank position, K, of first relevance doc
Reciprocal Rank score
MRR is the mean RR across multiple queries

44
NDCG

Normalized Discounted Cumulative Gain
Multiple Levels of Relevance
DCG
contribution of ith rank position
Ex has DCG score of
NDCG is normalized DCG
best possible ranking as score NDCG 1

45
Optimizing Rank-Based Measures

Lets directly optimize these measures
As opposed to some proxy (pairwise prefs)
But
Objective function no longer decomposes
Pairwise prefs decomposed into each pair
Objective function flat or discontinuous

46
Discontinuity Example
D1 D2 D3
Retrieval Score 0.9 0.6 0.3
Rank 1 2 3
Relevance 0 1 0

NDCG 0.63

47
Discontinuity Example

NDCG computed using rank positions
Ranking via retrieval scores

D1 D2 D3
Retrieval Score 0.9 0.6 0.3
Rank 1 2 3
48
Discontinuity Example

NDCG computed using rank positions
Ranking via retrieval scores
Slight changes to model parameters
Slight changes to retrieval scores
No change to ranking
No change to NDCG

D1 D2 D3
Retrieval Score 0.9 0.6 0.3
Rank 1 2 3
49
Discontinuity Example

NDCG computed using rank positions
Ranking via retrieval scores
Slight changes to model parameters
Slight changes to retrieval scores
No change to ranking
No change to NDCG

NDCG discontinuous w.r.t model parameters!
D1 D2 D3
Retrieval Score 0.9 0.6 0.3
Rank 1 2 3
50
Yue Burges, 2007
51
Optimizing Rank-Based Measures

Relaxed Upper Bound
Structural SVMs for hinge loss relaxation
SVM-map Yue et al., 2007
Chapelle et al., 2007
Boosting for exponential loss relaxation
Zheng et al., 2007
AdaRank Xu et al., 2007
Smooth Approximations for Gradient Descent
LambdaRank Burges et al., 2006
SoftRank GP Snelson Guiver, 2007

52
Structural SVMs

Let x denote the set of documents/query examples
for a query
Let y denote a (weak) ranking
Same objective function
Constraints are defined for each incorrect
labeling y over the set of documents x.
After learning w, a prediction is made by
sorting on wTxi

Tsochantaridis et al., 2007
53
Structural SVMs for MAP

Maximize
subject to
where
( yij -1, 1 )
and
Sum of slacks upper bound MAP loss.

Yue et al., 2007
54
Too Many Constraints!

For Average Precision, the true labeling is a
ranking where the relevant documents are all
ranked in the front, e.g.,
An incorrect labeling would be any other ranking,
e.g.,
This ranking has Average Precision of about 0.8
with ?(y,y) ¼ 0.2
Intractable number of rankings, thus an
intractable number of constraints!

55
Structural SVM Training

STEP 1 Solve the SVM objective function using
only the current working set of constraints.
STEP 2 Using the model learned in STEP 1, find
the most violated constraint from the exponential
set of constraints.
STEP 3 If the constraint returned in STEP 2 is
more violated than the most violated constraint
the working set by some small constant, add that
constraint to the working set.
Repeat STEP 1-3 until no additional constraints
are added. Return the most recent model that was
trained in STEP 1.