Predicting%20Diverse%20Subsets%20Using%20Structural%20SVMs

About This Presentation

Title:

Predicting%20Diverse%20Subsets%20Using%20Structural%20SVMs

Description:

Thorsten Joachims (Cornell University) Result #18. Top of ... Okapi. 0.472. Unweighted Model. 0.471. Essential Pages. 0.434. SVM-div. 0.349. SVM-div2. 0.382 ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 43

Provided by: Yison4

Category:

more less

Transcript and Presenter's Notes

Title: Predicting%20Diverse%20Subsets%20Using%20Structural%20SVMs

1
Predicting Diverse Subsets Using Structural SVMs

ICML 2008
Yisong Yue
Cornell University
In collaboration with
Thorsten Joachims (Cornell University)

2
Query Jaguar
Top of First Page
Bottom of First Page
Result 18
Results From 11/27/2007
3
Need for Diversity (in IR)

Ambiguous Queries
Users with different information needs issuing
the same textual query
Jaguar
At least one relevant result for each information
need
Learning Queries
User interested in a specific detail or entire
breadth of knowledge available
Swaminathan et al., 2008
Results with high information diversity

4
Learning to Rank

Current methods
Real valued retrieval functions f(q,d)
Sort by f(q,di) to obtain ranking
Benefits
Know how to perform learning
Can optimize for rank-based performance measures
Outperforms traditional IR models
Drawbacks
Cannot account for diversity
During prediction, considers each document
independently

5
Optimizing Diversity

Interest in information retrieval
Carbonell Goldstein, 1998 Zhai et al., 2003
Zhang et al., 2005 Chen Karger, 2006
Swaminathan et al., 2008
Requires inter-document dependencies
Impossible given current learning to rank methods
No consensus on how to measure diversity.

6
Contribution

Formulate as predicting diverse subsets
Use training data with explicitly labeled
subtopics (TREC 6-8 Interactive Track)
Use loss function to encode subtopic loss
Perform training using structural SVMs
Tsochantaridis et al., 2005

7
Representing Diversity

Current datasets use manually determined subtopic
labels
E.g., Use of robots in the world today
Nanorobots
Space mission robots
Underwater robots
Manual partitioning of the total information
regarding a query
Relatively reliable

8
Example

Choose K documents with maximal information
coverage.
For K 3, optimal set is D1, D2, D10

9
Maximizing Subtopic Coverage

Goal select K documents which collectively cover
as many subtopics as possible.
Perfect selection takes n choose K time.
Basically a set cover problem.
Greedy gives (1-1/e)-approximation bound.
Special case of Max Coverage (Khuller et al, 1997)

10
Weighted Word Coverage

More distinct words more information
Weight word importance
Does not depend on human labels
Goal select K documents which collectively cover
as many distinct (weighted) words as possible
Greedy selection also yields (1-1/e) bound.
Need to find good weighting function (learning
problem).

11
Example
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Marginal Benefit
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2
12
Example
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Marginal Benefit
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2 -- 2 3 D3
13
Related Work Comparison

Essential Pages Swaminathan et al., 2008
Uses fixed function of word benefit
Depends on word frequency in candidate set
Our method directly learns word benefit function
Feature space based on word frequency
Optimizes for subtopic loss
First learning method for optimizing subtopic
loss
(to our knowledge)

14
Linear Discriminant

x (x1,x2,,xn) - candidate documents
y subset of x
V(y) union of words from documents in y.
Discriminant Function
?(v,x) frequency features (e.g., 10, 20,
etc).
Benefit of covering word v is then wT?(v,x)

15
Linear Discriminant

Does NOT reward redundancy
Benefit of each word only counted once
Greedy has (1-1/e)-approximation bound
Linear (joint feature space)
Allows for SVM optimization

16
More Sophisticated Discriminant

Documents cover words to different degrees
A document with 5 copies of Thorsten might
cover it better than another document with only 2
copies.
Use multiple word sets, V1(y), V2(y), , VL(y)
Each Vi(y) contains only words satisfying certain
importance criteria.

17
More Sophisticated Discriminant

Separate ?i for each importance level i.
Joint feature map ? is vector composition of all
?i

Greedy has (1-1/e)-approximation bound.
Still uses linear feature space.

18
Conventional SVMs

Input x (high dimensional point)
Target y (either 1 or -1)
Prediction sign(wTx)
Training
subject to
The sum of slacks upper bounds the
accuracy loss

19
Adapting to Predicting Subsets

Input x (candidate set of documents)
Target y (subset of x of size K)
Same objective function
Constraints for each incorrect labeling y.
Score of best y at least as large as incorrect y
plus loss
Weighted subtopic loss (later)

20
Illustrative Example

Original SVM Problem
Exponential constraints
Most are dominated by a small set of important
constraints

Structural SVM Approach
Repeatedly finds the next most violated
constraint
until set of constraints is a good
approximation.

21
Illustrative Example

Original SVM Problem
Exponential constraints
Most are dominated by a small set of important
constraints

Structural SVM Approach
Repeatedly finds the next most violated
constraint
until set of constraints is a good
approximation.

22
Illustrative Example

Original SVM Problem
Exponential constraints
Most are dominated by a small set of important
constraints

Structural SVM Approach
Repeatedly finds the next most violated
constraint
until set of constraints is a good
approximation.

23
Illustrative Example

Original SVM Problem
Exponential constraints
Most are dominated by a small set of important
constraints

Structural SVM Approach
Repeatedly finds the next most violated
constraint
until set of constraints is a good
approximation.

24
Weighted Subtopic Loss

Example
x1 covers t1
x2 covers t1,t2,t3
x3 covers t1,t3
Motivation
Higher penalty for not covering popular subtopics
Mitigates effects of label noise in tail
subtopics

Docs Loss
t1 3 1/2
t2 1 1/6
t3 2 1/3
25
TREC Experiments

TREC 6-8 Interactive Track Queries
Documents labeled into subtopics.
17 queries used,
considered only relevant docs
decouples relevance problem from diversity
problem
45 docs/query, 20 subtopics/query, 300 words/doc

26
TREC Experiments

12/4/1 train/valid/test split
Approx 500 documents in training set
Permuted until all 17 queries were tested once
Set K5 (some queries have very few documents)
SVM-div uses term frequency thresholds to
define importance levels
SVM-div2 in addition uses TFIDF thresholds

27
TREC Results
Method Loss
Random 0.469
Okapi 0.472
Unweighted Model 0.471
Essential Pages 0.434
SVM-div 0.349
SVM-div2 0.382
Methods W / T / L
SVM-div vs Ess. Pages 14 / 0 / 3
SVM-div2 vs Ess. Pages 13 / 0 / 4
SVM-div vs SVM-div2 9 / 6 / 2
28
Can expect further benefit from having more
training data.
29
Consistently outperforms Essential Pages
30
Summary

Formulated diversified retrieval as predicting
diverse subsets
Efficient training and prediction algorithms
Used weighted word coverage as proxy to
information coverage.
Encode diversity criteria using loss function
Weighted subtopic loss

http//projects.yisongyue.com/svmdiv/
31
Extra Slides
32
Essential Pages
33
Essential Pages
x (x1,x2,,xn) - set of candidate documents for
a query y a subset of x of size K (our
prediction).
Benefit of covering word v with document
xi Importance of covering word v
Swaminathan et al., 2008
34
Essential Pages
x (x1,x2,,xn) - set of candidate documents for
a query y a subset of x of size K (our
prediction).
Benefit of covering word v with document
xi Importance of covering word v

Intuition
Frequent words cannot encode information
diversity.
Infrequent words do not provide significant
information

Swaminathan et al., 2008
35
Essential Pages
x (x1,x2,,xn) - set of candidate documents for
a query y a subset of x of size K (our
prediction).
Benefit of covering word v with document
xi Importance of covering word v

Intuition
Frequent words cannot encode information
diversity.
Infrequent words do not provide significant
information

Swaminathan et al., 2008
36
Finding Most Violated Constraint
37
Finding Most Violated Constraint

A constraint is violated when
Finding most violated constraint reduces to

38
Finding Most Violated Constraint

Encode each subtopic as an additional word to
be covered.
Use greedy prediction to find approximate most
violated constraint.

39
Approximate Constraint Generation

Theoretical guarantees no longer hold.
Might not find an epsilon-close approximation to
the feasible region boundary.
Performs well in practice.

40
Approximate constraint generation seems to work
perform well.
41
Synthetic Dataset
42
Synthetic Dataset

Trec dataset very small
Synthetic dataset so we can vary retrieval size K
100 queries
100 docs/query, 25 subtopics/query, 300 words/doc
15/10/75 train/valid/test split

Write a Comment

User Comments (0)

About PowerShow.com

Predicting%20Diverse%20Subsets%20Using%20Structural%20SVMs - PowerPoint PPT Presentation

Predicting%20Diverse%20Subsets%20Using%20Structural%20SVMs

Thorsten Joachims (Cornell University) Result #18. Top of ... Okapi. 0.472. Unweighted Model. 0.471. Essential Pages. 0.434. SVM-div. 0.349. SVM-div2. 0.382 ... – PowerPoint PPT presentation