Predicting%20Diverse%20Subsets%20Using%20Structural%20SVMs - PowerPoint PPT Presentation

About This Presentation
Title:

Predicting%20Diverse%20Subsets%20Using%20Structural%20SVMs

Description:

Thorsten Joachims (Cornell University) Result #18. Top of ... Okapi. 0.472. Unweighted Model. 0.471. Essential Pages. 0.434. SVM-div. 0.349. SVM-div2. 0.382 ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 43
Provided by: Yison4
Category:

less

Transcript and Presenter's Notes

Title: Predicting%20Diverse%20Subsets%20Using%20Structural%20SVMs


1
Predicting Diverse Subsets Using Structural SVMs
  • ICML 2008
  • Yisong Yue
  • Cornell University
  • In collaboration with
  • Thorsten Joachims (Cornell University)

2
Query Jaguar
Top of First Page
Bottom of First Page
Result 18
Results From 11/27/2007
3
Need for Diversity (in IR)
  • Ambiguous Queries
  • Users with different information needs issuing
    the same textual query
  • Jaguar
  • At least one relevant result for each information
    need
  • Learning Queries
  • User interested in a specific detail or entire
    breadth of knowledge available
  • Swaminathan et al., 2008
  • Results with high information diversity

4
Learning to Rank
  • Current methods
  • Real valued retrieval functions f(q,d)
  • Sort by f(q,di) to obtain ranking
  • Benefits
  • Know how to perform learning
  • Can optimize for rank-based performance measures
  • Outperforms traditional IR models
  • Drawbacks
  • Cannot account for diversity
  • During prediction, considers each document
    independently

5
Optimizing Diversity
  • Interest in information retrieval
  • Carbonell Goldstein, 1998 Zhai et al., 2003
    Zhang et al., 2005 Chen Karger, 2006
    Swaminathan et al., 2008
  • Requires inter-document dependencies
  • Impossible given current learning to rank methods
  • No consensus on how to measure diversity.

6
Contribution
  • Formulate as predicting diverse subsets
  • Use training data with explicitly labeled
    subtopics (TREC 6-8 Interactive Track)
  • Use loss function to encode subtopic loss
  • Perform training using structural SVMs
  • Tsochantaridis et al., 2005

7
Representing Diversity
  • Current datasets use manually determined subtopic
    labels
  • E.g., Use of robots in the world today
  • Nanorobots
  • Space mission robots
  • Underwater robots
  • Manual partitioning of the total information
    regarding a query
  • Relatively reliable

8
Example
  • Choose K documents with maximal information
    coverage.
  • For K 3, optimal set is D1, D2, D10

9
Maximizing Subtopic Coverage
  • Goal select K documents which collectively cover
    as many subtopics as possible.
  • Perfect selection takes n choose K time.
  • Basically a set cover problem.
  • Greedy gives (1-1/e)-approximation bound.
  • Special case of Max Coverage (Khuller et al, 1997)

10
Weighted Word Coverage
  • More distinct words more information
  • Weight word importance
  • Does not depend on human labels
  • Goal select K documents which collectively cover
    as many distinct (weighted) words as possible
  • Greedy selection also yields (1-1/e) bound.
  • Need to find good weighting function (learning
    problem).

11
Example
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Marginal Benefit
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2
12
Example
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Marginal Benefit
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2 -- 2 3 D3
13
Related Work Comparison
  • Essential Pages Swaminathan et al., 2008
  • Uses fixed function of word benefit
  • Depends on word frequency in candidate set
  • Our method directly learns word benefit function
  • Feature space based on word frequency
  • Optimizes for subtopic loss
  • First learning method for optimizing subtopic
    loss
  • (to our knowledge)

14
Linear Discriminant
  • x (x1,x2,,xn) - candidate documents
  • y subset of x
  • V(y) union of words from documents in y.
  • Discriminant Function
  • ?(v,x) frequency features (e.g., 10, 20,
    etc).
  • Benefit of covering word v is then wT?(v,x)

15
Linear Discriminant
  • Does NOT reward redundancy
  • Benefit of each word only counted once
  • Greedy has (1-1/e)-approximation bound
  • Linear (joint feature space)
  • Allows for SVM optimization

16
More Sophisticated Discriminant
  • Documents cover words to different degrees
  • A document with 5 copies of Thorsten might
    cover it better than another document with only 2
    copies.
  • Use multiple word sets, V1(y), V2(y), , VL(y)
  • Each Vi(y) contains only words satisfying certain
    importance criteria.

17
More Sophisticated Discriminant
  • Separate ?i for each importance level i.
  • Joint feature map ? is vector composition of all
    ?i
  • Greedy has (1-1/e)-approximation bound.
  • Still uses linear feature space.

18
Conventional SVMs
  • Input x (high dimensional point)
  • Target y (either 1 or -1)
  • Prediction sign(wTx)
  • Training
  • subject to
  • The sum of slacks upper bounds the
    accuracy loss

19
Adapting to Predicting Subsets
  • Input x (candidate set of documents)
  • Target y (subset of x of size K)
  • Same objective function
  • Constraints for each incorrect labeling y.
  • Score of best y at least as large as incorrect y
    plus loss
  • Weighted subtopic loss (later)

20
Illustrative Example
  • Original SVM Problem
  • Exponential constraints
  • Most are dominated by a small set of important
    constraints
  • Structural SVM Approach
  • Repeatedly finds the next most violated
    constraint
  • until set of constraints is a good
    approximation.

21
Illustrative Example
  • Original SVM Problem
  • Exponential constraints
  • Most are dominated by a small set of important
    constraints
  • Structural SVM Approach
  • Repeatedly finds the next most violated
    constraint
  • until set of constraints is a good
    approximation.

22
Illustrative Example
  • Original SVM Problem
  • Exponential constraints
  • Most are dominated by a small set of important
    constraints
  • Structural SVM Approach
  • Repeatedly finds the next most violated
    constraint
  • until set of constraints is a good
    approximation.

23
Illustrative Example
  • Original SVM Problem
  • Exponential constraints
  • Most are dominated by a small set of important
    constraints
  • Structural SVM Approach
  • Repeatedly finds the next most violated
    constraint
  • until set of constraints is a good
    approximation.

24
Weighted Subtopic Loss
  • Example
  • x1 covers t1
  • x2 covers t1,t2,t3
  • x3 covers t1,t3
  • Motivation
  • Higher penalty for not covering popular subtopics
  • Mitigates effects of label noise in tail
    subtopics

Docs Loss
t1 3 1/2
t2 1 1/6
t3 2 1/3
25
TREC Experiments
  • TREC 6-8 Interactive Track Queries
  • Documents labeled into subtopics.
  • 17 queries used,
  • considered only relevant docs
  • decouples relevance problem from diversity
    problem
  • 45 docs/query, 20 subtopics/query, 300 words/doc

26
TREC Experiments
  • 12/4/1 train/valid/test split
  • Approx 500 documents in training set
  • Permuted until all 17 queries were tested once
  • Set K5 (some queries have very few documents)
  • SVM-div uses term frequency thresholds to
    define importance levels
  • SVM-div2 in addition uses TFIDF thresholds

27
TREC Results
Method Loss
Random 0.469
Okapi 0.472
Unweighted Model 0.471
Essential Pages 0.434
SVM-div 0.349
SVM-div2 0.382
Methods W / T / L
SVM-div vs Ess. Pages 14 / 0 / 3
SVM-div2 vs Ess. Pages 13 / 0 / 4
SVM-div vs SVM-div2 9 / 6 / 2
28
Can expect further benefit from having more
training data.
29
Consistently outperforms Essential Pages
30
Summary
  • Formulated diversified retrieval as predicting
    diverse subsets
  • Efficient training and prediction algorithms
  • Used weighted word coverage as proxy to
    information coverage.
  • Encode diversity criteria using loss function
  • Weighted subtopic loss

http//projects.yisongyue.com/svmdiv/
31
Extra Slides
32
Essential Pages
33
Essential Pages
x (x1,x2,,xn) - set of candidate documents for
a query y a subset of x of size K (our
prediction).
Benefit of covering word v with document
xi Importance of covering word v
Swaminathan et al., 2008
34
Essential Pages
x (x1,x2,,xn) - set of candidate documents for
a query y a subset of x of size K (our
prediction).
Benefit of covering word v with document
xi Importance of covering word v
  • Intuition
  • Frequent words cannot encode information
    diversity.
  • Infrequent words do not provide significant
    information

Swaminathan et al., 2008
35
Essential Pages
x (x1,x2,,xn) - set of candidate documents for
a query y a subset of x of size K (our
prediction).
Benefit of covering word v with document
xi Importance of covering word v
  • Intuition
  • Frequent words cannot encode information
    diversity.
  • Infrequent words do not provide significant
    information

Swaminathan et al., 2008
36
Finding Most Violated Constraint
37
Finding Most Violated Constraint
  • A constraint is violated when
  • Finding most violated constraint reduces to

38
Finding Most Violated Constraint
  • Encode each subtopic as an additional word to
    be covered.
  • Use greedy prediction to find approximate most
    violated constraint.

39
Approximate Constraint Generation
  • Theoretical guarantees no longer hold.
  • Might not find an epsilon-close approximation to
    the feasible region boundary.
  • Performs well in practice.

40
Approximate constraint generation seems to work
perform well.
41
Synthetic Dataset
42
Synthetic Dataset
  • Trec dataset very small
  • Synthetic dataset so we can vary retrieval size K
  • 100 queries
  • 100 docs/query, 25 subtopics/query, 300 words/doc
  • 15/10/75 train/valid/test split
Write a Comment
User Comments (0)
About PowerShow.com